Microsoft Word - Vol 4-1 pages 37-53.doc IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 37 COMPREHENSIVE DOWNTIME PREDICTION IN NEXT- GENERATION INTERNET W. F. AL-KHATEEB, S. AL-IRHAYIM, K. A. AL-KHATEEB Department of Electrical and Computer Engineering, Faculty of Engineering, International Islamic University Malaysia, 53100 Kuala Lumpur, Malaysia. e-mail: wajdi@iiu.edu.my Abstract: The benchmark for the reliability quality of networks depends mainly on the accuracy and comprehensiveness of the reliability parameters. Downtime prediction of a communication system is crucial for the quality of service (QoS) offered to the end-user. Markov model enables analytical calculation of average single figure cumulative downtime over one year. The single average approach, generally, does not adequately describe the wide range of service performance that is likely to be experienced in communications systems due to the random nature of the failure. Therefore, it would be more appropriate to add downtime distribution obtained from network availability models to predict the expected cumulative downtime and other performance parameters among a large number of system populations. The distribution approach provides more comprehensive information about the behavior of the individual systems. Laplace- Stieltjes transform enables analytical solutions for simple network architectures, i.e. the simplex system and the parallel system. This paper uses simulations to determine reliability parameters for complex architecture such as the Multiprotocol Label Switching (MPLS) backbone planned for next-generation Internet. In addition to the single figure downtime, simulations provide other reliability parameters such as probability of zero downtime. The paper also considers the downtime distribution among a population of equally designed systems. Key Words: Reliability, availability, downtime, Multiprotocol label switching 1. INTRODUCTION The Internet is challenged to narrow down the significant quality and reliability gaps that still exist between the circuit-switched, voice-based telecom networks and the packet- switched, IP-based Internet. QoS, a mechanism that provides distinction of traffic type, which can be classified and administered differently through the network focuses on parameters that depend on application framework. Typical parameters are: Timeliness, Accuracy and Reliability. QoS has become a key research area in the Internet Engineering Task Force (IETF). Several RFCs [1], [2] and Internet drafts [3], [4] are published proposing differing technologies on how to define and deal with reservation and differentiated services, in short “guaranties” needed by mission critical and real time services. But at the same time, IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 38 QoS technology over the Internet has not been comprehensively defined. Numerous tools still require definition. No doubt these gaps will be filled one day, or at least, redefined given sufficient marketing and economic impetus beside a widespread acceptance among the users community. The focus of this paper, however, is on the platform needed by all those technologies in order to become acceptable by critical users of the telecommunications community, i.e. the availability of the backbone for the Internet that is able to function in a similar way as the conventional telecom infrastructure known with its high reliability since the 60s of the 20th century. 1.1 INTERNET RELIABILITY BENCHMARKING In order to appreciate the role of high availability in next generation Internet it is helpful to highlight the difference in the terminology before proceeding with any analysis. While “reliability” R(t) is the probability that a system will not fail during the time t, the availability A(t) of a system is the probability that the system is operating successfully at time t, or the ratio of time that the system is available to total time. It is common to express availability either in percentage, i.e. number of 9’s, or in downtime/year. Although 99% availability may sound good, it is far from being that, because it still results in over three and half days of down time per year and therefore it is far from acceptable by the telecom industry which is used to availabilities in the range of 99.99% to 99.999% [5], or four to five nines. Table 1 shows figures suggested by Intel [6] to reflect possible availability benchmarking. Table 1: Typical Suggested Availability Benchmarking 9’s Availability Downtime/yr Examples 1 90.0% 36 day 12 hr Personal Clients 2 99.0% 87 hr 36 min Entry-level businesses 3 99.9% 8 hr 46 min ISPs, Mainstream business 4 99.99% 52 min 33 s Data centers 5 99.999% 5 min 15 s Carrier-grade Telco, banking 6 99.9999% 31.5 seconds Military defense system Two notable reliability objectives are:  Bell, 1964, Telephone objective [5] of ‘no more than 2 hr downtime in 40 yr’ which converts into ‘no more than 3 min/yr’ while applying to voice communication. In terms of number of 9s it is between five and six 9s.  Bellcore,1993, Objective for end office switching system access to the SS7 Common Channel Signaling (CCS) [7] is ‘two minutes per year’ that is equal to an access unavailability of about 3.8 x 10-6. This objective is also between five and six 9s. IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 39 1.2 MPLS Recovery Schemes Recovery in current IP based Internet is based on instant re-computation of alternative path following a failure of an active path. Time taken for such calculation exceeds the maximum delay allowed in mission critical and real time application. MPLS based recovery on the other hand is based on rerouting the traffic to predefined backup paths. protected route backup route ingress egress backone with recovery end-to-end rero uting fast rerou ting dynamic rer outing Fig. 1: MPLS Recovery Schemes. A widespread topology used in connection with the MPLS backbone consists of a main path, known as the “protected path” and a backup (or more) path(s). Each path consists of a number of disjoint Label-switch Routers (LSR) and two common edge routers. The two edge routers which are shared by the two paths are the starting router, or the “ingress” and the end router, or the “egress”, (Fig. 1). Backup path around the ingress-egress is kept ready to carry the rerouted traffic following a failure in the main path. This arrangement resembles the 1+1 and 1:1 backup protection/switching in the infrastructures of conventional telecom backbones. This topology is effective in dealing with:  Ingress-based protection/switching (pre-negotiated)  Fast-rerouting restoration by a specific node detecting a failure In order to allow such topology to deal with dynamic routing, the protected and the backup paths are equipped with equal number of LSRs. The routers in the protected path are connected to their peers in the backup path through cross-links allowing for dynamic rerouting around a failed link. The behavior of the restoration mechanisms, i.e. end-to-end routing, fast rerouting, and dynamic routing are shown in Fig. 1. Assuming perfectly reliable nodes and links with a finite reliability p, a network model with protection for MPLS backbone is shown in Fig. 2. The links A through H may assume any failure and repair rates. Fig. 2: Triple Bridge Network E G F K H J A C B I D IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 40 The size of such network is determined by the number of the cross-links. Due to the bridge structure of such complex network, it may be designated as n-tuple bridge. A network with one, two, three, or n cross-link(s) is called: single-bridge, double-bridge, triple-bridge, or n-tuple bridge. The scaling of the network structure, i.e. the number of the cross-links, is one of the criteria in evaluating the availability. 2. DOWNTIME OBJECTIVES Downtime objectives are intended to predict the amount of time that networks, either fully or partially, become unavailable, e.g. do not function to provide the intended service. These objectives are usually stated as a single number equal to the average unavailability among a population of systems designed to provide a given reliability/availability quality. Due to the random nature of the failure in communications systems, generally nodes and links, a single average does not adequately describe the wide range of service performance that is likely to be experienced. Therefore, it would be more appropriate to add downtime distribution obtained from network availability models to predict the expected cumulative downtime and other performance parameters. The downtime distribution provides a better way to respond to the query whether the range of a certain distribution is acceptable for the type of service being planned and hence allows consideration of tradeoffs between equipment failure rate and average restoration time in connection with the architecture of the system. 2.1 DOWNTIME PREDICTION The Markov’s model applied to a single component repairable system with two states- space diagram results in the differential equations )()()( 10 ' 0 tPtPtP   (1.a) )()()( 10 ' 1 tPtPtP   (1.b) where P0 is the probability of being in state 0 or being operative, P1 is the probability being in state 1 or failed, λ and µ are the failure and repair rate both having exponential distribution. The solution [12] leads to tetPtA )()(0)(                 (2.a) With A(t) is the availability and A is the steady state availability      )(0lim tPA t (2.b) Hence for a simplex system without built-in redundancy (e.g., hot-standby) A can be rewritten in terms of MTTF (Mean Time to Fail) and MTTR (Mean Time to Repair) IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 41 MTTRMTTF MTTF A   where MTTF = 1/λ and MTTR = 1/μ and the unavailability is     AA 1 (3) With A is being the unavailability, Eq. 3 results from the state-space diagram for a single component repairable system. The expected cumulative steady-state system downtime can be expressed as E (cummulative downtime in a steady state period of length T) T      (4) Using 525,600 minutes per year as a base, the downtime objective for a simplex unit is 525, 600 minutes per year      The unavailability of a complex system of two identical simplex units connected in parallel and operating in hot-standby mode is given by 2)(    A (5) Equation 5 results from the state space diagram for a two components repairable system. For given MTTF, the availability is solely influenced by MTTR, which is set according to the criticality of the system. Typical values of restoration rate that represent a cross section among systems that need to be treated with different degree of urgency are:  CCIR uses MTTR of 10 hrs which is a relatively a high number based on assumption of no technician on duty  U.S. Department of Defense may use figures as low as 20 min based on the assumption of the availability of technician and spare parts on sites  An average of 4 hr/failure is the basis to achieve an objective of two-minutes access downtime in connection with SS7 Common Channel Signaling network (Bellcore, 1993) For the particular case of the parallel A-link access to the SS7 Common Channel Signaling, the downtime objective is 2( ) 525, 600 2 minutes per year       (6) IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 42 Assuming an average restoration time MTTR of 4 hours, corresponding to a repair rate of µ = 6/day, it follows that the downtime objective for the parallel system would be met at an MTTF of 2047 hr/failure. 2.2 DOWNTIME DISTRIBUTION According to Barlow [8] the exact distribution for the downtime of a simplex system having exponential failure rate with failure distribution F(t) = 1– e–at and exponential repair time with repair distribution G(t) = 1– e–bt (a and b are the failure and repair respectively). With above distributions and Laplace-Stieltjes transform the distribution (t,x) of a downtime x during an observed time “t” is given as          dyyxtabIyextab xtaext x by ))(2()(1)(),( 1 0 2/1 (7) where I1(x) is the Bessel function of order 1 for the imaginary argument defined by:       0 1 !)1(! 12 )2/( )( j jj j x xI A simplex system with a cumulative average downtime objective of 17.5 hrs/yr and average restoration time of 4 hr/failure has selected distributions among the population of the systems as shown in Table 2 [9]. Table 2: Downtime distribution of simplex system. Expected cumulative down time 0 > 17.5 hr > 40 hr >53 hr Percent of units in the population 1.2% 43.2% 5% 1% While the parameter of interest for the single system is the downtime distribution among a population of systems having same design objective, a parallel architecture will have an additional focus on the very small downtime objective associated with a large percentage of the population having zero downtime distribution over one year. The analytical solution of the downtime distribution for the parallel system is more complex. The numerical solution is not straight forward, however, upper and lower bounds on downtime distribution can be found that provide good estimation. The model for the parallel system assumes renewal process with negative exponential failure and restoration time distributions. The distributions are denoted by F(t) = 1– e–t/U and G(t) = 1– e–t/D where U is the expected time to unit failure and D is the expected unit restoration time. Reciprocal of U and D are also defined as: IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 43 10010110 , 1 , 1   and D and U For a down time  and its distribution FD, the Laplace-Stieltjes transform of the steady state cumulative down time distribution over a time period of length T is given by [7] ] )()( )2()( [exp )()( )2()( )/(),(),( 2121 2 10 2 0 wwww www wwww ww TFdewF DT wT D            (8) where w1 and w2 are the solutions of 02)2( 10210 2   ww (9) Exact solution of Eq. 8 is not possible [7], however upper and lower bounds of the downtime distribution FD (T,α) are found to be good estimates of the downtime distribution associated with a small downtime objective such as the 2 minutes objective. In the following analysis only the upper bound is considered for being a more conservative estimate. The upper bound can be found for all 0 ≤ α ≤ T, )]}(1[2exp{)(),( 01   TKTHTFD (10) where H(T) is given by Eq. 13 and )(exp)1()exp()(1 21 TwBTwBTK  (11) where 10 12 102     B ww w B  (12) While availability of spares is considered to be unlimited [7-9], the analysis in [11] considers the effect of limited spares on the downtime. This paper is a theoretical extension (with examples) of the Barlow and Proschan “Mathematical Theory of Reliability” [8]. It discusses a methodology that addresses the success probability of enhanced mission time with selected repair actions and spare. Spares have to be kept limited in complex practical systems because of cost and other considerations. This work studies the effect of spare availability on the availability of the system beside the failure and repair rates. Eq. 7 is rewritten to include “sparing factor” in addition to other reliability parameters. 2.3 PROBABILITY OF ZERO DOWNTIME End office switching system access to the SS7 Common Channel Signaling network is an example of systems designed to meet very small downtime objective, such as 2 minutes downtime over one year. This objective is small compared to a year of operating time and equals an ineffective attempt rate of 3.8 per million. As a consequence several years of observation time is required in order to obtain a reliable estimate of the average. Another IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 44 consequence is that, within certain population the number of systems that experience zero downtime is expected to be very large. According to Hamilton [7], the probability that a parallel system has zero downtime during a steady state period of length T is derived using the steady-state downtime distribution given by Eq. 8. This probability is denoted as 1–H(T).     ww expexp)/ [1( H(T)-1  (13) where )12(1 )12)(1( 2 )/10(1 2 )/10( www ww A        (14) and w1 and w2 are the solutions of Eq. 9. 3. SIMULATION MODELING As complexity of systems architecture grows beyond the simplex and the parallel model, simulation offers an alternative way for prediction system performance. Backbones of next generation Internet are among complex networks that need to be analyzed in term of availability and downtime distribution to provide comprehensive prediction extending beyond the single figure average downtime. However, as a first step the simulation model is required to be validated, i.e. to determine whether it is an accurate representation of the actual system being analyzed and to ensure the model’s faithfulness to simulate the real complex network being presented. In other words, crucial evidence is needed to support the model’s credibility. While simulations in this work focus on the reliability of MPLS backbone, simulations in [10] are dealing with other performance issues. In this paper a simulator is proposed for MPLS path restoration that can simulate the mechanism of the Haskin scheme or the “fast restoration”, the Makam scheme or the “end-to-end” restoration, and the dynamic scheme. It helps evaluating the reliability of the MPLS restoration backbone as well as comparing the QoS traffic with best effort traffic. Performance evaluation criteria are: packet loss, reordering of packets and resource utilization. The paper covers the 1:1 protection topology and promises to evaluate the more complex 1:N topology. 3.1 MODEL VALIDATION The validation procedure compares between the analytical results of the parallel architecture and the results obtained from its simulation model with respect to the following parameter: 1. The availability A of the system 2. The average expected downtime per year 3. The probability of zero downtime among a population of systems IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 45 The analytical solutions for the two first parameters follow from the solution of the Markov differential Eqs. 1.a and 1.b for the continuous process [9]. The solution for the third parameter is obtained by using the Laplace-Stieltjes transform, Eq. 13. The simulator RAPTOR is used for the purpose of the validation of the simple parallel architecture as well as for the subsequent simulations of the three complex network architectures of the n-tuple bridge type with the following considerations: 1. An effective system population of 10,000 has been assumed in the simulation processes. 2. The input data, i.e. MTTF and MTTR, for the validation process are based on the average downtime of two minutes per year. An assumed average restoration time of four hours, exponentially distributed, would result in a failure rate proportional to an MTTF of 2047 hr, also exponentially distributed. Results from simulation model are compared to the analytical results by assigning each block in the Reliability Block Diagram (RBD) the assumed values of the MTTR and MTBF of the analytical model. The validations are as follows: 3.1.1. Validation of the Availability. The analytical availability figure A1 based on exponential distributed MTTR = 4 hrs and MTBF= 2047 hrs is A1 = 0.9999962. Availability obtained from RBD model at 10 replications is A2 = 0.99999629 at a standard deviation σ = 0.00000179. The hypothesis test statistics is based on nS AA t / 12  with the rejection region | t | > t α/2 ; (n–1) where t 0.025 ; 9 = 2.262 t = (0.99999629 – 0.9999962)/(0.00000179 / √ 10) = 0.159 Hence | t | = 0.159 is not greater than t0.025 ; 9 = 2.262 and not in the rejection region. Therefore at α = 0.05 level of significance it can be concluded that there is no difference between the two results. 3.1.2. Validation of the Average Downtime. The average downtime obtained from RBD is DT2 = 1.95 minutes at a standard deviation σ = 0.9406 is compared with the analytical average downtime over one year of DT1 = 2 minutes. | t | = |1.95 – 2|/ (0.946/√10) = 0.167 < t 0.025 ; 9 = 2.262 IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 46 Similarly, it can be concluded that there is no difference between the analytical and simulation result at the assumed significance level. 3.1.3. Validation of the Probability of Zero Downtime. The analytical probability of zero downtime in the system population at system downtime objective of two minutes per year is P1 = 1–H1(1yr) = 98.35%. The respective figure obtained from the simulation of the parallel architecture is P2 = 1–H2(1yr) = 98.56. The standard deviation is σ = 0.435 t = (98.56 – 98.35)/(0.435/√10)= 1.525 < t 0.025 ; 9 = 2.262 Similar acceptance is justified for the zero downtime probability. Following the positive validation of the reliability parameters for the simple parallel architecture, simulations are carried out to analyze the behavior of complex architectures for Internet based backbone. The chosen models represent MPLS backbones with restoration and dynamic routing capabilities. The simulations are intended to analyze the backbone behavior between the two edge-routers of an MPLS domain, i.e. the ingress and the egress by producing two type of availability information: 1. Single figure availability objective figures: Availability, Downtime over one year, and Probability of zero downtime over one year 2. Availability distribution information: Complementary cumulative downtime distribution 3.2 NETWORK ARCHITECTURE A scalable model is used to enable comparison between backbones of various sizes. Assuming links of finite reliability p and perfectly reliable nodes, a network model with protection as in Fig. 3 is set as a basis for the simulation analysis in this paper. The scaling elements are the parallel links 1.1, 1.2, …1.n+1 in the protected path and 2.1, 2.2, …., 2.n+1 in the backup path. The cross links are numbered 1, 2, …n, with n being the scaling parameter of the network. This model represents a complex bridge of n cross links referred to as n-tuple bridge. Three variations are considered in the simulation processes, namely: single, triple, and 6-tuple bridge. Fig. 3: n-tuple Bridge Network 1.2 2.1 2.2 2.3 1.3 n1 2 2. n+1 1.1 1. n+1 IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 47 4. SIMULATION RESULTS Parameters analyzed in the simulations are: the availability, the average downtime over one year, the probability of zero downtime over one year, and the downtime distributions. The results obtained from the simulations are then compared to the analytical results of the parallel model. 4.1 THE AVAILABILITY Availability figures of the simulated models show direct dependency on MTTF and MTTR of the links as well as the size of the models. Simulations are based on equal rates for all links. For the parallel system, unit restoration rate is kept constant at 4 hr/failure while MTTF rates are: 2047 hr, 1445 hr, and 1179 hr. These figures result in average downtime over one year of 2, 4, and 6 minutes respectively. Fig. 4 shows the availability of the three models obtained from simulation processes over a population of 10,000 samples compared to the analytical availability figure of the parallel architecture. The links in the simulated models are configured with failure and restoration rates identical to those of the parallel system. Availability changes between 0.9999962 and 0.9999886 in the parallel model as compared to between 0.999974 and 0.999917 in the 6-tuple model. Fig. 4: Effect of Failure-rate and Scaling on Availability 4.2 AVERAGE DOWNTIME IN ONE YEAR In the second simulation, the average cumulative downtime figures of the three complex models are compared to the steady state analytical cumulative downtime values of the simple parallel model (Fig. 5). The comparisons show an increase in the cumulative downtime in two ways. First, due to the increased failure rate shown as increase of the downtime objective of the parallel model from two minutes to six minutes. Second, due to architecture scaling from a single bridge to a triple bridge and finally to a 6-tuple bridge. The restoration rate is kept constant at 4 hours/failure during the simulation processes. The availability and cumulative downtime figures for the complex models are related to each other, as those from the analytical do, by: parallel single bridge triple bridge 6-tuple bridge 0.9999 0.99992 0.99994 0.99996 0.99998 1 2.0 3.0 4.0 5.0 6.0 Downtime Objective per Year (mins) A va ila bi li IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 48  A)-(1 T)length of period statesteady ain downtime e(cumulativ E where A is the availability, E is the expected cumulative downtime, and T is the total observed time such as one year. The expected cumulative downtime over one year for the 6-tuple model from this relation is 13.6656 minutes compared to the simulated figure of 13.666 obtained from the simulation. The error between the two results is 0.003 %. Fig. 5: Effect of Failure-rate and Scaling on Downtime. Table 3: Cumulative Downtime/yr (minutes). parallel system single bridge triple bridge 6-tuple bridge 2 4.078 7.7 13.666 4 7.64 16.69 27.232 6 12.13 23.66 43.37 Table 3 compares the simulation results of the cumulative downtime per year for the three complex models with the analytical results of the simple parallel system. The links of the analytical parallel model and the simulation models are configured with same failure and restoration rates. 4.3 PROBABILITY OF ZERO DOWNTIME The third group of simulations is intended to determine the expected number of systems among a population that will experience zero downtime over one year. Since in the considered downtime objective of two to six minutes per year only little downtime may be observed, a very large number of population samples are needed to obtain reliable estimation. The simulation results are compared with the analytical results of the zero parallel single-bridge triple-bridge 6-tuple bridge 0 10 20 30 40 50 1 2 3 4 5 6 7 Downtime Objective of Parallel System (mins) A ve ra ge D ow nt im e/ Y ea r (m in ) Average Restoration Time = 4 hrs IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 49 downtime probability of the simple parallel system. The analytical availability model developed in [7] shows that 98.35% of the system population will experience zero downtime over one year under an average downtime objective of two minutes. This figure will decrease to 96.73 % at an average downtime objective of 4 minutes and finally to 95.13 % at 6 minutes objective. The simulation model of the complex system closest to the parallel system, i.e. the single bridge system results in zero downtime probability figures of 97.05 %, 93.69 %, and 90.44 % respectively under identical failure and restoration rates as the parallel system. Fig. 6 shows the behavior of the probability of zero downtime of the three complex models as compared to the analytical simple parallel model. Table 4 summarizes the results of the simulation processes for the probability of zero downtime among system populations in the three complex simulation models as compared to the analytical parallel model. The links of the analytical parallel model and the simulation models are configured with same failure and restoration rates. Table 4: Probability of Zero Downtime. parallel system single bridge triple bridge 6-tuple bridge 98.35% 97.05% 93.68% 88.94% 2 mins 4.08 7.7 13.67 96.73% 93.69% 87.10% 79.14% 4 mins 7.64 16.69 27.23 95.13% 90.44% 82.01% 69.96% 6 mins 12.13 23.66 43.37 The columns in Table 4 represent the three complex systems in addition to the parallel system while the rows include the probability figures of zero downtime for a given system (the upper entry in the row) associated with a given reliability objective (the lower entry in the same row). For example, the probability of zero downtime for the parallel system is 98.35 % at a reliability objective of two minutes, which decreases to 95.13 % at an objective of six minutes. Similarly the probability of zero downtime for a 6-tuple bridge is 88.94 % at an objective of 13.67 minutes (a figure resulting from configuring the components of the 6-tuple bridge with same failure and repair rates as that of the parallel system for a two minutes objective), which decreases to 69.96 % at an objective of 43.37 minutes. IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 50 Fig. 6: Probability of Zero Downtime per Year. 4.4 DOWNTIME PROBABILITY DISTRIBUTION The fourth type of simulation processes is the most detailed one as it moves beyond the single figure average to provide more comprehensive details on the behavior of the units among the population that have been designed with equal downtime objective. The probability of zero downtime in one year provides a single figure information about the percentage of the systems that will experience no downtime. The average downtime provides information about the average downtime experienced by large number of system population. But neither tells us how do those units expected to experience equal average downtime will behave in detail. Here again, simulation results will be compared to the analytical results of the simple parallel system. In the parallel system with two minutes downtime objective only 1.65 % are expected to fail over the duration of one year. But more than 60 % of those expected to fail over one year, or one percent of the entire population, will have a cumulative downtime of over one hour. Furthermore, 0.6 % and 0.2 % of the entire population are expected to have a downtime of two hours and four hours respectively. Similarly, it is obvious from the simulation results of the 6-tuple bridge at an objective downtime of 13.67 minutes per year, that 1 % of the population will have a downtime of over five hours. Furthermore 1.58 %, 2.62 %, 4.18 %, and 6.72 % of the population will have cumulative downtime over one year of 4 , 3, 2, and one hour respectively. Fig. 7 shows the complementary downtime distribution for the three complex models as compared to the analytical parallel architecture. Links failure rates and restoration rates for all systems are configured identically with MTTR = 4 hrs and MTTF = 2047hrs correspondent to an average downtime objective of two minutes over one year for the parallel architecture. 6-tuple bridge triple bridge single bridge parallel 1 0 1 2 3 4 5 6 Downtime Objective per Year (mins) 0.95 0.9 0.85 0.8 0.75 0.7 IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 51 Fig. 7: Complementary Downtime Distribution over One Year. Table 5 summarizes the downtime distribution behavior of the three models as compared to the analytical parallel architecture model. Table 5: Complementary Downtime Distribution. Percent of Population parallel system single bridge triple bridge 6-tuple bridge Zero Downtime 98.35% 97.05% 93.68% 88.94% % > 1 hr 1.00% 1.85% 4% 6.72% % > 2 hrs 0.62% 1.18% 2.43% 4.18% % > 3 hrs 0.38% 0.89% 1.45% 2.62% % > 4hrs 0.23% 0.56% 0.91% 1.58% Downtime/yr 2 min 4.078 7.7 13.666 5. CONCLUSIONS This paper presents a simulation based approach for comprehensive downtime prediction of networks backbone found in MPLS based next-generation Internet. Available analytical models for comprehensive downtime prediction manage only simple network architectures of single component simplex system and simple backup (parallel) system. In this work a simulation model is used to analyze networks of complex structures with tendency to scale up. Comprehensive downtime prediction adds more confidence to the reliability analysis of systems with random failure behavior as compared to the more common single figure average downtime prediction approach. 0.001 0.01 0.1 0 40 80 120 160 200 240 280 320 Downtime Threshold (mins) parallel single-bridge triple-bridge 6-tuple bridge MTTR = 4 hrs MTTF = 2047 hrs IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 52 Average downtime approach predicts equal downtime to all systems members of wide population that are equally designed to deliver a specific downtime objective. The comprehensive downtime on the other hand provides information on the distribution of the expected downtime behavior across system population. Hence, it enables us to determine the extreme cases as well as specific downtime behavior such as percentage of systems with zero downtime over one year, percentage within the designed downtime objective, or percentages exceeding specific limits. In summary, although the approach considered in this paper is aimed at the network infrastructure of next-generation Internet, they are equally applicable to complex networks of similar structure. REFERENCES [1] [RFC2212] S. Shenker et al, “Specification of Guaranteed QoS”, 1997. [2] [RFC2475] S. Blake et al, “An Architecture for Differentiated Services”, 1999. [3] P. Ford and Y. Bernet, “Integrated Services-over-Differentiated Services”, IETF Internet Draft, 1998 [4] Y. Bernet et al, “A Framework for Differentiated Services”, IETF Internet Draft, 1999. [5] H. Malec, “Communications reliability: a historical perspective,” IEEE Transaction on Reliability, vol. 47, pp. 333-344, September 1998. [6] “Economics of High availability for Telecommunications Systems”, An Intel® Primer, Intel Corporation, 2001. [7] C.M. Hamilton, N.A. Marlow, “Analyzing telecommunications network availability performance using the downtime probability distribution.” Globcom, December1991. [8] R.E. Barlow, F. Proschan, “Mathematical Theory of Reliability.” New York: John Wiley, 1965 [9] E.A. Elsayed, “Reliability Engineering”, Addison Wesley Longman, Inc., 1996. [10] G. Ahn and W. Chun, “Design and Implementation of MPLS Network Simulator Supporting LDP and CR-LDP,” IEEE International Conference on Networks (ICON2000), Singapore Sept. 2000 [11] L.H. Crow, “A method for Achieving an Enhanced Mission Capability”, 48th Reliability and Maintainability Symposium, RAMS 2002, The International Symposium on Product Quality & Integrity, 2002. [12] R. Billinton, and N.R Allan, “Reliability Evaluation of Engineering System”, Plenum Press, 1992. IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 53 BIOGRAPHIES Wajdi Fawzi M. Al-Khateeb received his M.Sc. degree from Berlin University of Technology, Germany in 1968. He Joined Baghdad University of Technology in 1968 and later persued in planning and consultancy engineering career in telecommunications industry from 1971 to 1993 with a focus on the reliability of telecommunications systems and disaster recovery. He joined IIUM as a lecture in 1995. His research interest include reliability of telecommunications systems and computer networks, networks simulation and modeling, traffic engineering and quality of service (QoS) in computer networks and Internet. Sufyan Al-Irhayim is currently Associate Professor at the Department of Computer Engineering, University of Bahrain. He received his B.Sc. degree from the University of Mosul, Iraq in 1977. He obtained his M.Sc. in 1980 and his PhD in 1985 from University of Bradford, England. He Joined University of Mosul, Iraq in 1987, IIUM in 1991, and University of Bahrain, Bahrain in 2004. His research interests include performance and reliability of IP networking, mobility and multicasting in computer networking, network security, and intelligent reactive compensator Prof. Khalid al-Khateeb studied in the United Kingdom and graduated at the Royal College with an Honors B.Sc. degree in Electronics in 1966, the Master’s degree in 1971 at Salford University and the PhD in 1975 at Manchester University. His research interest include electronics, communications, computer applications and engineering education. His academic life stretches over decades, during which he worked at various universities in the UK, USA, Iraq, Algeria, Jordan and Malaysia.