Microsoft Word - Vol 4-1 pages 37-53.doc


IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 

 37

COMPREHENSIVE DOWNTIME PREDICTION IN NEXT-
GENERATION INTERNET 

W. F. AL-KHATEEB, S. AL-IRHAYIM, K. A. AL-KHATEEB 

Department of Electrical and Computer Engineering, Faculty of Engineering, International 
Islamic University Malaysia, 53100 Kuala Lumpur, Malaysia. 

e-mail: wajdi@iiu.edu.my 

Abstract: The benchmark for the reliability quality of networks depends mainly on the 
accuracy and comprehensiveness of the reliability parameters. Downtime prediction of a 
communication system is crucial for the quality of service (QoS) offered to the end-user. 
Markov model enables analytical calculation of average single figure cumulative 
downtime over one year. The single average approach, generally, does not adequately 
describe the wide range of service performance that is likely to be experienced in 
communications systems due to the random nature of the failure. Therefore, it would be 
more appropriate to add downtime distribution obtained from network availability 
models to predict the expected cumulative downtime and other performance parameters 
among a large number of system populations. The distribution approach provides more 
comprehensive information about the behavior of the individual systems. Laplace-
Stieltjes transform enables analytical solutions for simple network architectures, i.e. the 
simplex system and the parallel system. This paper uses simulations to determine 
reliability parameters for complex architecture such as the Multiprotocol Label 
Switching (MPLS) backbone planned for next-generation Internet. In addition to the 
single figure downtime, simulations provide other reliability parameters such as 
probability of zero downtime. The paper also considers the downtime distribution among 
a population of equally designed systems. 

Key Words: Reliability, availability, downtime, Multiprotocol label switching 

1. INTRODUCTION 

 The Internet is challenged to narrow down the significant quality and reliability gaps 
that still exist between the circuit-switched, voice-based telecom networks and the packet-
switched, IP-based Internet. QoS, a mechanism that provides distinction of traffic type, 
which can be classified and administered differently through the network focuses on 
parameters that depend on application framework. Typical parameters are: Timeliness, 
Accuracy and Reliability. 

QoS has become a key research area in the Internet Engineering Task Force (IETF). 
Several RFCs [1], [2] and Internet drafts [3], [4] are published proposing differing 
technologies on how to define and deal with reservation and differentiated services, in 
short “guaranties” needed by mission critical and real time services. But at the same time, 


IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 

 38

QoS technology over the Internet has not been comprehensively defined. Numerous tools 
still require definition. No doubt these gaps will be filled one day, or at least, redefined 
given sufficient marketing and economic impetus beside a widespread acceptance among 
the users community. The focus of this paper, however, is on the platform needed by all 
those technologies in order to become acceptable by critical users of the 
telecommunications community, i.e. the availability of the backbone for the Internet that is 
able to function in a similar way as the conventional telecom infrastructure known with its 
high reliability since the 60s of the 20th century. 

1.1 INTERNET RELIABILITY BENCHMARKING 

In order to appreciate the role of high availability in next generation Internet it is 
helpful to highlight the difference in the terminology before proceeding with any analysis. 
While “reliability” R(t) is the probability that a system will not fail during the time t, the 
availability A(t) of a system is the probability that the system is operating successfully at 
time t, or the ratio of time that the system is available to total time.   It is common to 
express availability either in percentage, i.e. number of 9’s, or in downtime/year. Although 
99% availability may sound good, it is far from being that, because it still results in over 
three and half days of down time per year and therefore it is far from acceptable by the 
telecom industry which is used to availabilities in the range of 99.99% to 99.999% [5], or 
four to five nines. Table 1 shows figures suggested by Intel [6] to reflect possible 
availability benchmarking. 

Table 1: Typical Suggested Availability Benchmarking 

9’s Availability Downtime/yr Examples 
1 90.0% 36 day 12 hr Personal  Clients 
2 99.0% 87 hr 36 min Entry-level businesses 

3 99.9% 8 hr 46 min ISPs, Mainstream business  

4 99.99% 52 min 33 s  Data centers 

5 99.999% 5 min 15 s Carrier-grade Telco, banking  
6 99.9999% 31.5 seconds Military defense system  

 
Two notable reliability objectives are: 

 Bell, 1964, Telephone objective [5] of ‘no more than 2 hr downtime in 40 yr’ which 
converts into ‘no more than 3 min/yr’ while applying to voice communication. In 
terms of number of 9s it is between five and six 9s. 

 Bellcore,1993, Objective for end office switching system access to the SS7 Common 
Channel Signaling (CCS) [7] is ‘two minutes per year’ that is equal to an access 
unavailability of about 3.8 x 10-6. This objective is also between five and six 9s. 


IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 

 39

1.2 MPLS Recovery Schemes 

Recovery in current IP based Internet is based on instant re-computation of alternative 
path following a failure of an active path. Time taken for such calculation exceeds the 
maximum delay allowed in mission critical and real time application. MPLS based 
recovery on the other hand is based on rerouting the traffic to predefined backup paths.   

protected route

backup route

ingress

egress

backone with recovery end-to-end rero uting

fast rerou ting dynamic rer outing  

Fig. 1: MPLS Recovery Schemes. 

A widespread topology used in connection with the MPLS backbone consists of a main 
path, known as the “protected path” and a backup (or more) path(s). Each path consists of 
a number of disjoint Label-switch Routers (LSR) and two common edge routers. The two 
edge routers which are shared by the two paths are the starting router, or the “ingress” and 
the end router, or the “egress”,    (Fig. 1). Backup path around the ingress-egress is kept 
ready to carry the rerouted traffic following a failure in the main path. This arrangement 
resembles the 1+1 and 1:1 backup protection/switching in the infrastructures of 
conventional telecom backbones. This topology is effective in dealing with: 

 Ingress-based protection/switching (pre-negotiated) 
 Fast-rerouting restoration by a specific node detecting a failure 

In order to allow such topology to deal with dynamic routing, the protected and the 
backup paths are equipped with equal number of LSRs. The routers in the protected path 
are connected to their peers in the backup path through cross-links allowing for dynamic 
rerouting around a failed link. The behavior of the restoration mechanisms, i.e. end-to-end 
routing, fast rerouting, and dynamic routing are shown in Fig. 1. Assuming perfectly 
reliable nodes and links with a finite reliability p, a network model with protection for 
MPLS backbone is shown in Fig. 2. The links A through H may assume any failure and 
repair rates. 

 
Fig. 2:  Triple Bridge Network 

E G

F

K

H 

J

A C

B 

I 

D

 
IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 

 40

The size of such network is determined by the number of the cross-links. Due to the 
bridge structure of such complex network, it may be designated as n-tuple bridge. A 
network with one, two, three, or n cross-link(s) is called: single-bridge, double-bridge, 
triple-bridge, or n-tuple bridge. The scaling of the network structure, i.e. the number of the 
cross-links, is one of the criteria in evaluating the availability.   

2. DOWNTIME OBJECTIVES 

Downtime objectives are intended to predict the amount of time that networks, either 
fully or partially, become unavailable, e.g. do not function to provide the intended service. 
These objectives are usually stated as a single number equal to the average unavailability 
among a population of systems designed to provide a given reliability/availability quality. 
Due to the random nature of the failure in communications systems, generally nodes and 
links, a single average does not adequately describe the wide range of service performance 
that is likely to be experienced. Therefore, it would be more appropriate to add downtime 
distribution obtained from network availability models to predict the expected cumulative 
downtime and other performance parameters. The downtime distribution provides a better 
way to respond to the query whether the range of a certain distribution is acceptable for 
the type of service being planned and hence allows consideration of tradeoffs between 
equipment failure rate and average restoration time in connection with the architecture of 
the system. 

2.1 DOWNTIME PREDICTION 

The Markov’s model applied to a single component repairable system with two states-
space diagram results in the differential equations 

)()()( 10
'

0 tPtPtP    (1.a) 

)()()( 10
'

1 tPtPtP    (1.b) 

where P0 is the probability of being in state 0 or being operative, P1 is the probability 
being in state 1 or failed, λ and µ are the failure and repair rate both having exponential 
distribution. The solution [12] leads to 

tetPtA )()(0)(






 














  (2.a) 

With A(t) is the availability and A is the steady state availability  








)(0lim tPA t   (2.b) 

Hence for a simplex system without built-in redundancy (e.g., hot-standby) A can be 
rewritten in terms of MTTF (Mean Time to Fail) and MTTR (Mean Time to Repair) 


IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 

 41

MTTRMTTF
MTTF

A


  

where MTTF = 1/λ  and MTTR = 1/μ and the unavailability is 





 AA 1  (3) 

With A is being the unavailability, Eq. 3 results from the state-space diagram for a 
single component repairable system. The expected cumulative steady-state system 
downtime can be expressed as 

               E (cummulative downtime in a steady state period of length T) T


 



 (4) 

Using 525,600 minutes per year as a base, the downtime objective for a simplex unit is 

525, 600 minutes per year


 



 

The unavailability of a complex system of two identical simplex units connected in 
parallel and operating in hot-standby mode is given by  

2)(





A  (5) 

Equation 5 results from the state space diagram for a two components repairable 
system. 

For given MTTF, the availability is solely influenced by MTTR, which is set according 
to the criticality of the system. Typical values of restoration rate that represent a cross 
section among systems that need to be treated with different degree of urgency are: 

 CCIR uses MTTR of 10 hrs which is a relatively a high number based on 
assumption of no technician on duty 

 U.S. Department of Defense may use figures as low as 20 min based on the 
assumption of the availability of technician and spare parts on sites 

 An average of 4 hr/failure is the basis to achieve an objective of two-minutes access 
downtime in connection with SS7 Common Channel Signaling network (Bellcore, 
1993) 

For the particular case of the parallel A-link access to the SS7 Common Channel 
Signaling, the downtime objective is 

2( ) 525, 600 2 minutes per year


 
 


 (6) 


IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 

 42

Assuming an average restoration time MTTR of 4 hours, corresponding to a repair rate 
of µ = 6/day, it follows that the downtime objective for the parallel system would be met 
at an MTTF of 2047 hr/failure.   

2.2 DOWNTIME DISTRIBUTION 

 According to Barlow [8] the exact distribution for the downtime of a simplex system 
having exponential failure rate with failure distribution F(t) = 1– e–at and exponential 
repair time with repair distribution G(t) = 1– e–bt (a and b are the failure and repair 
respectively). With above distributions and Laplace-Stieltjes transform the distribution 
(t,x)  of a downtime x during an observed time “t” is given as 









   dyyxtabIyextab

xtaext
x

by ))(2()(1)(),( 1
0

2/1              (7) 

where I1(x) is the Bessel function of order 1 for the imaginary argument defined by: 
   




 




0

1
!)1(!

12
)2/(

)(
j jj

j
x

xI  

A simplex system with a cumulative average downtime objective of 17.5 hrs/yr and 
average restoration time of 4 hr/failure has selected distributions among the population of 
the systems as shown in Table 2 [9]. 

Table 2: Downtime distribution of simplex system. 

Expected cumulative 
down time 

0 > 17.5 hr > 40 hr >53 hr 

Percent of units in the 
population 

1.2% 43.2% 5% 1% 

 
While the parameter of interest for the single system is the downtime distribution 
among a population of systems having same design objective, a parallel architecture will 
have an additional focus on the very small downtime objective associated with a large 
percentage of the population having zero downtime distribution over one year. The 
analytical solution of the downtime distribution for the parallel system is more complex. 
The numerical solution is not straight forward, however, upper and lower bounds on 
downtime distribution can be found that provide good estimation. The model for the 
parallel system assumes renewal process with negative exponential failure and restoration 
time distributions. The distributions are denoted by 

F(t) = 1– e–t/U   and  G(t) = 1– e–t/D 

where U is the expected time to unit failure and D is the expected unit restoration time. 

Reciprocal of U and D are also defined as:  


IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 

 43

10010110 ,
1

,
1

  and
D

and
U

  
For a down time  and its distribution FD, the Laplace-Stieltjes transform of the steady 
state cumulative down time distribution over a time period of length T is given by [7] 

]
)()(
)2()(

[exp
)()(
)2()(

)/(),(),(
2121

2
10

2

0 wwww
www

wwww
ww

TFdewF DT
wT

D










 


       (8) 

where w1 and w2 are the solutions of 

 02)2( 10210
2   ww                 (9) 

Exact solution of Eq. 8 is not possible [7], however upper and lower bounds of the 
downtime distribution FD (T,α) are found to be good estimates of the downtime 
distribution associated with a small downtime objective such as the 2 minutes objective. In 
the following analysis only the upper bound is considered for being a more conservative 
estimate. The upper bound can be found for all 0 ≤ α ≤ T, 

)]}(1[2exp{)(),( 01   TKTHTFD          (10) 

where H(T) is given by Eq. 13 and  

)(exp)1()exp()(1 21 TwBTwBTK                                              (11) 

where     

10
12

102 



 B
ww

w
B


             (12) 

While availability of spares is considered to be unlimited [7-9], the analysis in [11] 
considers the effect of limited spares on the downtime. This paper is a theoretical 
extension (with examples) of the Barlow and Proschan “Mathematical Theory of 
Reliability” [8]. It discusses a methodology that addresses the success probability of 
enhanced mission time with selected repair actions and spare. Spares have to be kept 
limited in complex practical systems because of cost and other considerations. This work 
studies the effect of spare availability on the availability of the system beside the failure 
and repair rates. Eq. 7 is rewritten to include “sparing factor” in addition to other 
reliability parameters.  

2.3 PROBABILITY OF ZERO DOWNTIME 

End office switching system access to the SS7 Common Channel Signaling network is 
an example of systems designed to meet very small downtime objective, such as 2 minutes 
downtime over one year. This objective is small compared to a year of operating time and 
equals an ineffective attempt rate of 3.8 per million. As a consequence several years of 
observation time is required in order to obtain a reliable estimate of the average. Another 


IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 

 44

consequence is that, within certain population the number of systems that experience zero 
downtime is expected to be very large. 

According to Hamilton [7], the probability that a parallel system has zero downtime 
during a steady state period of length T is derived using the steady-state downtime 
distribution given by Eq. 8. This probability is denoted as 1–H(T). 

 


 ww expexp)/ [1( H(T)-1           (13) 

where 

          
)12(1

)12)(1(
2

)/10(1

2
)/10(

www

ww
A













 (14) 

and w1 and w2 are the solutions of Eq. 9. 

3. SIMULATION MODELING 

As complexity of systems architecture grows beyond the simplex and the parallel 
model, simulation offers an alternative way for prediction system performance. Backbones 
of next generation Internet are among complex networks that need to be analyzed in term 
of availability and downtime distribution to provide comprehensive prediction extending 
beyond the single figure average downtime. However, as a first step the simulation model 
is required to be validated, i.e. to determine whether it is an accurate representation of the 
actual system being analyzed and to ensure the model’s faithfulness to simulate the real 
complex network being presented. In other words, crucial evidence is needed to support 
the model’s credibility. 

While simulations in this work focus on the reliability of MPLS backbone, simulations 
in [10] are dealing with other performance issues. In this paper a simulator is proposed for 
MPLS path restoration that can simulate the mechanism of the Haskin scheme or the “fast 
restoration”, the Makam scheme or the “end-to-end” restoration, and the dynamic scheme. 
It helps evaluating the reliability of the MPLS restoration backbone as well as comparing 
the QoS traffic with best effort traffic. Performance evaluation criteria are: packet loss, 
reordering of packets and resource utilization. The paper covers the 1:1 protection 
topology and promises to evaluate the more complex 1:N topology. 

3.1 MODEL VALIDATION 

The validation procedure compares between the analytical results of the parallel 
architecture and the results obtained from its simulation model with respect to the 
following parameter: 

1. The availability A of the system 
2. The average expected downtime per year 
3. The probability of zero downtime among a  population of systems 


IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 

 45

The analytical solutions for the two first parameters follow from the solution of the 
Markov differential Eqs. 1.a and 1.b for the continuous process [9]. The solution for the 
third parameter is obtained by using the Laplace-Stieltjes transform, Eq. 13. 

The simulator RAPTOR is used for the purpose of the validation of the simple parallel 
architecture as well as for the subsequent simulations of the three complex network 
architectures of the n-tuple bridge type with the following considerations: 

1. An effective system population of 10,000 has been assumed in the simulation 
processes. 

2. The input data, i.e. MTTF and MTTR, for the validation process are based on the 
average downtime of two minutes per year. An assumed average restoration time of 
four hours, exponentially distributed, would result in a failure rate proportional to 
an MTTF of 2047 hr, also exponentially distributed. 

Results from simulation model are compared to the analytical results by assigning each 
block in the Reliability Block Diagram (RBD) the assumed values of the MTTR and 
MTBF of the analytical model.  

The validations are as follows: 

3.1.1. Validation of the Availability. 

 The analytical availability figure A1 based on exponential distributed MTTR = 4 hrs 
and MTBF= 2047 hrs is A1 = 0.9999962. Availability obtained from RBD model at 10 
replications is A2 = 0.99999629 at a standard deviation σ = 0.00000179. 

The hypothesis test statistics is based on 

nS
AA

t
/

12   

with the rejection region | t | > t α/2 ; (n–1)  where t 0.025 ; 9 = 2.262 

t = (0.99999629 – 0.9999962)/(0.00000179 / √ 10) = 0.159 

Hence | t | = 0.159 is not greater than t0.025 ; 9 = 2.262 and not in the rejection region. 
Therefore at α = 0.05 level of significance it can be concluded that there is no difference 
between the two results. 

3.1.2. Validation of the Average Downtime. 

The average downtime obtained from RBD is DT2 = 1.95 minutes at a standard 
deviation σ = 0.9406 is compared with the analytical average downtime over one year of 
DT1 = 2 minutes.  

| t | = |1.95 – 2|/ (0.946/√10) = 0.167 < t 0.025 ; 9 = 2.262 


IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 

 46

Similarly, it can be concluded that there is no difference between the analytical and 
simulation result at the assumed significance level. 

3.1.3. Validation of the Probability of Zero Downtime.  

The analytical probability of zero downtime in the system population at system 
downtime objective of two minutes per year is P1 = 1–H1(1yr) = 98.35%. The respective 
figure obtained from the simulation of the parallel architecture is P2 = 1–H2(1yr) = 98.56. 
The standard deviation is σ = 0.435 

t = (98.56 – 98.35)/(0.435/√10)= 1.525 < t 0.025 ; 9 = 2.262 

Similar acceptance is justified for the zero downtime probability. 

Following the positive validation of the reliability parameters for the simple parallel 
architecture, simulations are carried out to analyze the behavior of complex architectures  
for Internet based backbone. The chosen models represent MPLS backbones with 
restoration and dynamic routing capabilities. The simulations are intended to analyze the 
backbone behavior between the two edge-routers of an MPLS domain, i.e. the ingress and 
the egress by producing two type of availability information: 

1. Single figure availability objective figures: Availability, Downtime over one year, 
and Probability of zero downtime over one year 

2. Availability distribution information: Complementary cumulative downtime 
distribution 

3.2 NETWORK ARCHITECTURE 

A scalable model is used to enable comparison between backbones of various sizes. 
Assuming links of finite reliability p and perfectly reliable nodes, a network model with 
protection as in Fig. 3 is set as a basis for the simulation analysis in this paper. The scaling 
elements are the parallel links 1.1, 1.2, …1.n+1 in the protected path and 2.1, 2.2, …., 
2.n+1 in the backup path. The cross links are numbered 1, 2, …n, with n being the scaling 
parameter of the network. This model represents a complex bridge of n cross links referred 
to as n-tuple bridge. Three variations are considered in the simulation processes, namely: 
single, triple, and 6-tuple bridge. 

 
Fig. 3:  n-tuple Bridge Network 

1.2

  2.1 2.2 2.3

1.3

n1 2

2. n+1

 
1.1 1. n+1


IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 

 47

4. SIMULATION RESULTS 
Parameters analyzed in the simulations are: the availability, the average downtime over 

one year, the probability of zero downtime over one year, and the downtime distributions. 
The results obtained from the simulations are then compared to the analytical results of the 
parallel model.  

4.1 THE AVAILABILITY 

Availability figures of the simulated models show direct dependency on MTTF and 
MTTR of the links as well as the size of the models. Simulations are based on equal rates 
for all links. For the parallel system, unit restoration rate is kept constant at 4 hr/failure 
while MTTF rates are: 2047 hr, 1445 hr, and 1179 hr. These figures result in average 
downtime over one year of 2, 4, and 6 minutes respectively. Fig. 4 shows the availability 
of the three models obtained from simulation processes over a population of 10,000 
samples compared to the analytical availability figure of the parallel architecture. The 
links in the simulated models are configured with failure and restoration rates identical to 
those of the parallel system. Availability changes between 0.9999962 and 0.9999886 in 
the parallel model as compared to between 0.999974 and 0.999917 in the    6-tuple model. 

Fig. 4:  Effect of Failure-rate and Scaling on Availability 

4.2 AVERAGE DOWNTIME IN ONE YEAR 

In the second simulation, the average cumulative downtime figures of the three 
complex models are compared to the steady state analytical cumulative downtime values 
of the simple parallel model (Fig. 5). The comparisons show an increase in the cumulative 
downtime in two ways. First, due to the increased failure rate shown as increase of the 
downtime objective of the parallel model from two minutes to six minutes. Second, due to 
architecture scaling from a single bridge to a triple bridge and finally to a 6-tuple bridge. 
The restoration rate is kept constant at 4 hours/failure during the simulation processes. The 
availability and cumulative downtime figures for the complex models are related to each 
other, as those from the analytical do, by: 

parallel

single bridge

triple bridge

6-tuple bridge 

0.9999 

0.99992

0.99994 

0.99996 

0.99998

1

2.0 3.0 4.0 5.0 6.0

Downtime Objective per Year (mins)

A
va

ila
bi

li


IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 

 48

 A)-(1  T)length  of period statesteady  ain  downtime e(cumulativ E  

where A is the availability, E is the expected cumulative downtime, and T is the total 
observed time such as one year. The expected cumulative downtime over one year for the 
6-tuple model from this relation is 13.6656 minutes compared to the simulated figure of 
13.666 obtained from the simulation. The error between the two results is 0.003 %. 

Fig. 5:  Effect of Failure-rate and Scaling on Downtime. 

Table 3: Cumulative Downtime/yr (minutes). 

parallel 
system 

single 
bridge 

triple 
bridge 

6-tuple 
bridge 

2 4.078 7.7 13.666 
4 7.64 16.69 27.232 
6 12.13 23.66 43.37 

 
Table 3 compares the simulation results of the cumulative downtime per year for the 
three complex models with the analytical results of the simple parallel system. The links of 
the analytical parallel model and the simulation models are configured with same failure 
and restoration rates.  

4.3 PROBABILITY OF ZERO DOWNTIME 

The third group of simulations is intended to determine the expected number of systems 
among a population that will experience zero downtime over one year. Since in the 
considered downtime objective of two to six minutes per year only little downtime may be 
observed, a very large number of population samples are needed to obtain reliable 
estimation. The simulation results are compared with the analytical results of the zero 

parallel

single-bridge 

triple-bridge

6-tuple bridge 

0

10

20

30

40

50

1 2 3 4 5 6 7
Downtime Objective of Parallel System (mins) 

A
ve

ra
ge

 D
ow

nt
im

e/
Y

ea
r 

(m
in

) 

Average Restoration Time = 4 hrs 


IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 

 49

downtime probability of the simple parallel system. The analytical availability model 
developed in [7] shows that 98.35% of the system population will experience zero 
downtime over one year under an average downtime objective of two minutes. This figure 
will decrease to 96.73 % at an average downtime objective of 4 minutes and finally to 
95.13 % at 6 minutes objective. The simulation model of the complex system closest to the 
parallel system, i.e. the single bridge system results in zero downtime probability figures 
of 97.05 %, 93.69 %, and 90.44 % respectively under identical failure and restoration rates 
as the parallel system. Fig. 6 shows the behavior of the probability of zero downtime of 
the three complex models as compared to the analytical simple parallel model.  

Table 4 summarizes the results of the simulation processes for the probability of zero 
downtime among system populations in the three complex simulation models as compared 
to the analytical parallel model. The links of the analytical parallel model and the 
simulation models are configured with same failure and restoration rates. 

Table 4:  Probability of Zero Downtime. 

parallel 
system 

single 
bridge 

triple 
bridge 

6-tuple  
bridge 

98.35% 97.05% 93.68% 88.94% 
2 mins 4.08 7.7 13.67 
96.73% 93.69% 87.10% 79.14% 
4 mins 7.64 16.69 27.23 
95.13% 90.44% 82.01% 69.96% 
6 mins 12.13 23.66 43.37 

 
The columns in Table 4 represent the three complex systems in addition to the parallel 
system while the rows include the probability figures of zero downtime for a given  system 
(the upper entry in the row) associated with a given reliability objective (the lower entry in 
the same row). For example, the probability of zero downtime for the parallel system is 
98.35 % at a reliability objective of two minutes, which decreases to 95.13 % at an 
objective of six minutes. Similarly the probability of zero downtime for a 6-tuple bridge is 
88.94 % at an objective of 13.67 minutes (a figure resulting from configuring the 
components of the 6-tuple bridge with same failure and repair rates as that of the parallel 
system for a two minutes objective), which decreases to 69.96 % at an objective of 43.37 
minutes. 


IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 

 50

 
Fig. 6:  Probability of Zero Downtime per Year. 

4.4 DOWNTIME PROBABILITY DISTRIBUTION 

The fourth type of simulation processes is the most detailed one as it moves beyond the 
single figure average to provide more comprehensive details on the behavior of the units 
among the population that have been designed with equal downtime objective. The 
probability of zero downtime in one year provides a single figure information about the 
percentage of the systems that will experience no downtime. The average downtime 
provides information about the average downtime experienced by large number of system 
population. But neither tells us how do those units expected to experience equal average 
downtime will behave in detail. Here again, simulation results will be compared to the 
analytical results of the simple parallel system. 

In the parallel system with two minutes downtime objective only 1.65 % are expected 
to fail over the duration of one year. But more than 60 % of those expected to fail over one 
year, or one percent of the entire population, will have a cumulative downtime of over one 
hour. Furthermore, 0.6 % and 0.2 % of the entire population are expected to have a 
downtime of two hours and four hours respectively. Similarly, it is obvious from the 
simulation results of the 6-tuple bridge at an objective downtime of 13.67 minutes per 
year, that 1 % of the population will have a downtime of over five hours. Furthermore  
1.58 %, 2.62 %, 4.18 %, and 6.72 % of the population will have cumulative downtime 
over one year of 4 , 3, 2, and one hour respectively.   

Fig. 7 shows the complementary downtime distribution for the three complex models as 
compared to the analytical parallel architecture. Links failure rates and restoration rates for 
all systems are configured identically with MTTR = 4 hrs and MTTF = 2047hrs 
correspondent to an average downtime objective of two minutes over one year for the 
parallel architecture.  

6-tuple bridge 

triple bridge

single bridge

parallel

1

0 1 2 3 4 5 6
Downtime Objective per Year (mins) 

0.95

0.9

0.85

0.8

0.75

0.7


IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 

 51

 
Fig. 7:  Complementary Downtime Distribution over One Year. 

Table 5 summarizes the downtime distribution behavior of the three models as 
compared to the analytical parallel architecture model. 

Table 5: Complementary Downtime Distribution. 
Percent of 
Population  

parallel 
system 

single 
bridge 

triple 
bridge 

6-tuple  
bridge 

Zero Downtime 98.35% 97.05% 93.68% 88.94% 
% > 1 hr 1.00% 1.85% 4% 6.72% 
% > 2 hrs 0.62% 1.18% 2.43% 4.18% 
% > 3 hrs 0.38% 0.89% 1.45% 2.62% 
% > 4hrs 0.23% 0.56% 0.91% 1.58% 
Downtime/yr 2 min 4.078 7.7 13.666 

5. CONCLUSIONS 
This paper presents a simulation based approach for comprehensive downtime 

prediction of networks backbone found in MPLS based next-generation Internet. Available 
analytical models for comprehensive downtime prediction manage only simple network 
architectures of single component simplex system and simple backup (parallel) system. In 
this work a simulation model is used to analyze networks of complex structures with 
tendency to scale up. 

Comprehensive downtime prediction adds more confidence to the reliability analysis of 
systems with random failure behavior as compared to the more common single figure 
average downtime prediction approach. 

0.001

0.01

0.1

0 40 80 120 160 200 240 280 320
Downtime Threshold (mins) 

parallel

single-bridge

triple-bridge

6-tuple bridge

MTTR = 4 hrs           MTTF = 2047 hrs  


IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 

 52

 Average downtime approach predicts equal downtime to all systems members of wide 
population that are equally designed to deliver a specific downtime objective. The 
comprehensive downtime on the other hand provides information on the distribution of the 
expected downtime behavior across system population. Hence, it enables us to determine 
the extreme cases as well as specific downtime behavior such as percentage of systems 
with zero downtime over one year, percentage within the designed downtime objective, or 
percentages exceeding specific limits.  

In summary, although the approach considered in this paper is aimed at the network 
infrastructure of next-generation Internet, they are equally applicable to complex networks 
of similar structure.     

REFERENCES 

[1] [RFC2212] S. Shenker et al, “Specification of Guaranteed QoS”, 1997. 

[2] [RFC2475] S. Blake et al, “An Architecture for Differentiated Services”, 1999. 

[3] P. Ford and Y. Bernet, “Integrated Services-over-Differentiated Services”, IETF Internet 
Draft, 1998 

[4] Y. Bernet et al, “A Framework for Differentiated Services”, IETF Internet Draft, 1999. 

[5] H. Malec, “Communications reliability: a historical perspective,” IEEE Transaction on 
Reliability, vol. 47,  pp. 333-344,  September 1998. 

[6] “Economics of High availability for Telecommunications Systems”, An Intel® Primer, Intel 
Corporation, 2001. 

[7] C.M. Hamilton, N.A. Marlow, “Analyzing telecommunications network availability 
performance using the downtime probability distribution.” Globcom, December1991. 

[8] R.E. Barlow, F. Proschan, “Mathematical Theory of Reliability.” New York: John Wiley, 1965 

[9] E.A. Elsayed, “Reliability Engineering”, Addison Wesley Longman, Inc., 1996. 

[10] G. Ahn and W. Chun, “Design and Implementation of  MPLS Network Simulator Supporting   
LDP and CR-LDP,” IEEE International Conference on Networks (ICON2000), Singapore 
Sept. 2000 

[11] L.H. Crow, “A method for Achieving an Enhanced Mission Capability”, 48th Reliability and  
Maintainability Symposium, RAMS 2002, The International  Symposium on Product Quality 
& Integrity, 2002. 

[12] R. Billinton, and N.R Allan, “Reliability Evaluation of Engineering System”, Plenum Press,  
1992. 

 
IIUM Engineering Journal, Vol. 5, No. 1, 2004 W. F. Al-Khateeb et al. 

 53

BIOGRAPHIES 

Wajdi Fawzi M. Al-Khateeb received his M.Sc. degree from Berlin University of 
Technology, Germany in 1968. He Joined Baghdad University of Technology in 1968 and 
later persued in planning and consultancy engineering career in telecommunications 
industry from 1971 to 1993 with a focus on the reliability of telecommunications systems 
and disaster recovery. He joined IIUM as a lecture in 1995. His research interest include 
reliability of telecommunications systems and computer networks, networks simulation 
and modeling, traffic engineering and quality of service (QoS) in computer networks and 
Internet.    

Sufyan Al-Irhayim is currently Associate Professor at the Department of Computer 
Engineering, University of Bahrain. He received his B.Sc. degree from the University of 
Mosul,  Iraq  in 1977. He obtained his M.Sc. in 1980 and his PhD in 1985 from University 
of Bradford, England. He Joined University of Mosul, Iraq in 1987, IIUM in 1991, and 
University of Bahrain, Bahrain in 2004. His research interests include performance and 
reliability of IP networking, mobility and multicasting in computer networking, network 
security, and intelligent reactive compensator 

Prof. Khalid al-Khateeb studied in the United Kingdom and graduated at the Royal 
College with an Honors B.Sc. degree in Electronics in 1966, the Master’s degree in 1971 
at Salford University and the PhD in 1975 at Manchester University. His research interest 
include electronics, communications, computer applications and engineering education. 
His academic life stretches over decades, during which he worked at various universities 
in the UK, USA, Iraq, Algeria, Jordan and Malaysia.