Cubo_Reliability_versija5.dvi CUBO A Mathematical Journal Vol.12, No¯ 01, (7–13). March 2010 Analysis of the Component-Based Reliability in Computer Networks Saulius Minkevičius Vilnius University, Faculty of Mathematics and Informatics Naugarduko 24, 03225 Vilnius, Lithuania email : stst@ktl.mii.lt ABSTRACT Performance in terms of reliability of computer networks motivates this paper. Limit theorems on the extreme queue length and extreme virtual waiting time in open queueing networks in heavy traffic are derived and applied to a reliability model for computer networks where we relate the time of failure of a computer network to the system parameters. RESUMEN Desempeño en términos de fiabilidad de redes de computador notiva este art́ıculo. Teoremas ĺımite sobre la duración extrema de cola y el tiempo de espera virtual extremo en redes de cola abierta en trafico pesado sao derivados y aplicados a un modelo de fiabilidad para redes de computador donde relacionamos el tiempo de falha de una red de computador al sistema de parámetros. Key words and phrases: Performance evaluation, reliability theory, queueing theory, mathemat- ical models of technical systems, open queueing network, heavy traffic, the probability limit theorem, extreme values, queue length of customers, virtual waiting time of a customer. Math. Subj. Class.: 60K25, 60G70, 60F17. 8 Saulius Minkevičius CUBO 12, 1 (2010) 1 Introduction Probabilistic models and queueing networks have long been used to study the performance and reliability of computer systems [1, 2] and to analyse the performance and reliability of computer networks and of distributed information systems [3, 4]. In this paper, we will first briefly review the works related to using the queueing theory of computer system reliability, and then present some new results on the estimation of the time of failure of a computer network. In one of the first papers of this kind [6], the reliability of execution of programs in a distributed computing system is considered, showing that a program, which runs on multiple processing el- ements that have to communicate with other processing elements for remote data files, can be executed successfully despite that certain system components may be unreliable. In order to anal- yse the performance of multimedia service systems which have unreliable resources and to estimate their capacity requirements, a capacity planning model using an open queueing network is pre- sented in [9], and in [5] a novel model for a reliable system composed of N unreliable systems, which can hinder or enhance one anothers reliability, is discussed. In [10], the management pol- icy of an M /G/1 queue with a single removable and non-reliable server is discussed and analytic results are explored, using an efficient Matlab program to calculate the optimal threshold of the management policy and to evaluate system performance. In [11], the authors consider a single machine subject to breakdown and employ a fluid queue model with repair. In [12], the behaviour of a heterogeneous finite-source system with a single server is considered and applications in the field of telecommunications and reliability theory are treated. In this paper, first we present probability limit theorems on the extreme queue length of customers and the extreme virtual waiting time of a customer in heavy traffic for open queueing networks. We consider open networks with the “first come, first served” service discipline at each station and general distributions of interarrival and service times. The basic components of the queueing network are arrival processes, service processes, and routing processes among the different queues. The queueing network that we study has k single server stations, each of which has an associated infinite capacity waiting room. Every station has an arrival stream from outside the network, and the arrival streams are assumed to be mutually independent renewal processes. Customers are served in the order of arrival and after service they are randomly routed to either another station in the network, or out of the network entirely. Service times and routing decisions form mutually independent sequences of independent identically distributed random variables. 2 The Mathematical Model Let us consider the mutually independent sequences of independent identically distributed ran- dom variables { z (j) n , n ≥ 1 } , { S (j) n , n ≥ 1 } and { Φ (j) n , n ≥ 1 } for j = 1, 2, . . . , k; defined on the probability space. The random variables z (j) n and S (j) n are strictly positive, and Φ (j) n have sup- CUBO 12, 1 (2010) Analysis of the Component-Based Reliability ... 9 port in {0, 1, 2, . . . , k}. We define µj = ( M [ S (j) n ])−1 > 0, σj = D ( S (j) n ) > 0 and λj = ( M [ z (j) n ])−1 > 0, aj = D ( z (j) n ) > 0, j = 1, 2, ..., k; with all of these terms assumed finite. Denote pij = P ( Φ (i) n = j ) > 0, j = 1, 2, . . . , k. In the context of the queueing network being considered, the random variables z (j) n represent an interarrival time from outside the network at the station j, while S (j) n is the nth service time at the station j, and Φ (j) n is a routing indicator for the nth customer served at the station j. If Φ (i) n = j (which occurs with probability pij ), then the nth customer served at the station i is routed to the station j. When Φ (i) n = 0, the associated customer leaves the network. The matrix P is called a routing matrix. We observe that this system is quite general, encompassing tandem systems, acyclic networks of GI/G/1 queues, networks of GI/G/1 queues with feedback and open queueing networks. Let us define Qj(t) as the queue length of customers at the jth station of the queueing network at time t, β̂j = λj + k ∑ i=1 µi · pij − µj > 0, σ̂2j = (λj ) 3·Dz(j)n + k ∑ i=1 (µi) 3 ·DS(i)n ·(pij )2 +(µj )3·DS(j)n > 0, j = 1, 2, . . . , k and t > 0. Also, let us define Wj (t) as the virtual waiting time of a customer at the jth station of a queueing network at time t (one must wait until a customer arrives at the jth station of the queueing network to be served at time t), β̃j = λj + k ∑ i=1 µi · pij µj − 1, σ̃2j = k ∑ i=1 p2ij · µi · ( σj + ( µi µj )2 · σi ) + λj · ( σj + ( λj µj )2 · aj ) , j = 1, 2, · · · , k and t > 0. We will also assume that the following “overload conditions” are fulfilled: λj + k ∑ i=1 µi · pij > µj , j = 1, 2, . . . , k. (1) Note that these conditions guarantee that the length of all the queues will grow indefinitely with probability one. The results of the present paper are based on the following theorems: Theorem 1. If conditions (1) are fulfilled, then lim n→∞ P    sup 0≤s≤t Qj (ns) − β̂j · n · t σ̂j · √ n < x    = ∫ x −∞ exp ( − y 2 2t ) dy, 0 ≤ t ≤ 1 and j = 1, 2, . . . , k. and 10 Saulius Minkevičius CUBO 12, 1 (2010) Theorem 2. If conditions (1) are fulfilled, then lim n→∞ P    sup 0≤s≤t Vj (ns) − β̃j · n · t σ̃j · √ n < x    = ∫ x −∞ exp ( − y 2 2t ) dy, 0 ≤ t ≤ 1 and j = 1, 2, . . . , k. Proof. These theorems are proved in [7], and the proof is therefore omitted here so as not to lengthen this short paper. 3 The Reliability of a Computer Network Now we present a technical example from the computer network practice. Assume that queues arrive at a computer vj at the rate λj per hour during business hours, j = 1, 2, . . . ,p+r. These queues are served at the rate µj per hour in the computer vj , j = 1, 2, . . . ,p+r. After service in the computer vj , with probability pj (usually pj ≥ 0.9), they leave the network and with probability pji, i 6= j, 1 ≤ i ≤p+r (usually 0 < pji ≤ 0.1) arrive at the computer vi, i = 1, 2, . . . ,p+r. Also, we assume the computer vi fails when the virtual waiting time of queues is more than kj , j = 1, 2, . . . , p, and the computer vj fails when the queue length of queues is more than γi, i = 1, 2, . . . , r. In this section, we prove the following theorem on the probability that a computer network fails due to overload. Theorem 3. If t ≥ max ( max 1≤j≤p kj β̂j , max 1≤i≤r γi β̃i ) and conditions (1) are fulfilled, the computer network becomes unreliable (all computers fail). Proof. At first, using Theorem 1 and Theorem 2, we get that for x > 0 lim n→∞ P    sup 0≤s≤t Qj (ns) − β̂j · n · t σ̂j · √ n < x    = ∫ x −∞ exp ( − y 2 2t ) dy, j = 1, 2, . . . , p. (2) and lim n→∞ P    sup 0≤s≤t Vi(ns) − β̃i · n · t σ̃i · √ n < x    = ∫ x −∞ exp ( − y 2 2t ) dy, i = 1, 2, . . . , r. (3) CUBO 12, 1 (2010) Analysis of the Component-Based Reliability ... 11 Let us investigate a computer network which consists of the elements (computers) αj that are indicators of stations Xj , j = 1, 2, . . . , p and elements (computers) νi that are indicators of stations Yi, i = 1, 2, . . . , r Denote Xj = { 1, if the element αj is reliable 0, if the element αj is not reliable, j = 1, 2, . . . , p and Yi = { 1, if the element νi is reliable 0, if the element νi is not reliable, i = 1, 2, . . . , r. Note that {Xj = 1} = { sup 0≤s≤t Qj(ns) < kj}, j = 1, 2, . . . , p and {Yi = 1} = { sup 0≤s≤t Vi(ns) < γi}, i = 1, 2, . . . , r. Denote the structural function of the system of elements, connected by scheme 1 from p + r (see, for example, [8]), as follows: φ(X1, X2, . . . , Xp, Y1, Y2, . . . , Yr, t) = { 1, ∑p j=1 Xj + ∑r i=1 Yi ≥ 1 0, ∑p j=1 Xj + ∑r i=1 Yi < 1. Assume y = ∑p j=2 Xi + ∑r i=1 Yi. Estimate the reliability function of the system (computer network) using the formula of conditional probability h(X1, X2, . . . , Xp, Y1, Y2, . . . , Yr, t) = Eφ(X1, X2, . . . , Xp, Y1, Y2, . . . , Yr, t) = P (φ(X1, X2, . . . , Xp, Y1, Y2, . . . , Yr, t) = 1) = P ( p ∑ j=1 Xj + r ∑ i=1 Yi ≥ 1) = P (X1 + y ≥ 1) = P (X1 + y ≥ 1|y = 1) · P (y = 1) + P (X1 + y ≥ 1|y = 0) · P (y = 0) = P (X1 ≥ 0) · P (y = 1) + P (X1 ≥ 1) · P (y = 0) ≤ P (y = 1) + P (X1 ≥ 1) = P (y = 1) + P (X1 = 1) ≤ P (y ≥ 1) + P (X1 = 1) = P ( p ∑ j=2 Xj + r ∑ i=1 Yi ≥ 1) + P (X1 = 1) ≤ · · · ≤ p ∑ j=1 P (Xj = 1) + r ∑ i=1 P (Yi ≥ 1) ≤ · · · ≤ p ∑ j=1 P (Xj = 1) + r ∑ i=1 P (Yi = 1) = p ∑ j=1 P ( sup 0≤s≤t Qj (ns) ≤ kj ) + r ∑ i=1 P ( sup 0≤s≤t Vi(ns) ≤ γi). Thus, 0 ≤ h(X1, X2, . . . , Xp, Y1, Y2, . . . , Yr, t) ≤ p ∑ j=1 P ( sup 0≤s≤t Qj (ns) ≤ kj ) + r ∑ i=1 P ( sup 0≤s≤t Vi(ns) ≤ γi). (4) 12 Saulius Minkevičius CUBO 12, 1 (2010) Applying Theorem 1, we obtain that 0 ≤ lim n→∞ P ( sup 0≤s≤t Qj (ns) < kj ) = lim n→∞ P    sup 0≤s≤t Qj(ns) − β̂j · n · t σ̂j · √ n < kj − β̂j · n · t σ̂j · √ n    = ∫ −∞ −∞ exp ( − y 2 2t ) dy = 0. (5) j = 1, 2, . . . , p. From (5) it follows that for kj < ∞, lim n→∞ P ( sup 0≤s≤t Qj(ns) < kj ) = 0, j = 1, 2, . . . , p. (6) Similarly as in (5) - (6) we prove that for γi < ∞ lim n→∞ P ( sup 0≤s≤t Vi(ns) < γi ) = 0, i = 1, 2, . . . , r. (7) Consequently, lim n→∞ h(X1, X2, . . . , Xp, Y1, Y2, . . . , Yr, t) = 0 (see (5)-(7)), which completes the proof. Finally, we provide an expression for h(X1, X2, . . . , Xp, Y1, Y2, . . . , Yr, t), t > 0 by proving the following theorem. Theorem 4. h(X1, X2, . . . , Xp, Y1, Y2, . . . , Yr, t) is equal to exp(− ∑p j=1 P ( sup 0≤s≤t Qj(ns) < kj ) − ∑r i=1 P ( sup 0≤s≤t Vi(ns) < γi)). Proof. Let λj , j = 1, 2, . . . , p and qi, i = 1, 2, . . . , r be the traffic intensities related to each server. Then the probability of stopping this system is equal to e− ∑p j=1 λj − ∑ r i=1 qj (see, for example, [13]). But λj = M Xj = P (Xj = 1) = P ( sup 0≤s≤t Qj (ns) < kj ), j = 1, 2, . . . , p. (8) and qi = M Yi = P (Yi = 1) = P ( sup 0≤s≤t Vi(ns) < γi), j = 1, 2, . . . , .r. (9) Applying (8) and (9), we obtain that h(X1, X2, . . . , Xp, Y1, Y2, . . . , Yr, t) is equal to e − p ∑ j=1 λj − r ∑ i=1 qj = e − ∑p j=1 P ( sup 0≤s≤t Qj (ns)