International Journal of Applied Sciences and Smart Technologies International Journal of Applied Sciences and Smart Technologies Volume 2, Issue 2, pages 91–110 p-ISSN 2655-8564, e-ISSN 2685-9432 91 RAID Level 6 and Level 6+ Reliability Thomas Schwarz Department of Computer Science, Klingler College of Arts and Sciences, Marquette University, Milwaukee, Wisconsin, U.S.A. Corresponding Author: thomas.schwarz@marquette.edu (Received 19-08-2020; Revised 09-11-2020; Accepted 09-11-2020) Abstract Storage systems are built of fallible components but have to provide high degrees of reliability. Besides mirroring and triplicating data, redundant storage of information using erasure-correcting codes is the only possibility to have data survive device failure. We provide here exact formula for the data-loss probability of a disk array composed of several RAID Level 6 stripes. This two-failure tolerant is not only used in practice but can also provide a reference point for the assessment of other data organizations. Keywords: RAID level 6, robustness, reliability 1 Introduction Storage systems are built of a large number of fallible components. Failure of some of these components might be independent or might be correlated. Often, the causes of failures are obscure. For example, Jim Gray’s team at Microsoft research once observed that disk reliability in an experimental database for astronomy remarkably increased after switching to slightly more expensive enclosures [1]. In this case, the vibrations of a failing disk drives caused its neighbors to fail with a much higher probability. Failures can happen at all levels in the storage hierarchy. Central components such as air conditioning or power supply can fail, and data centers can be flooded. Individual International Journal of Applied Sciences and Smart Technologies Volume 2, Issue 2, pages 91–110 p-ISSN 2655-8564, e-ISSN 2685-9432 92 racks can also suffer failure of networking and electricity supply. Individual disks can fail in toto or suffer latent sector failures. These incidents happen at surprisingly high rates even in large state-of-the art data centers [2, 3, 4]. To protect against failures, data has to be stored redundantly. With the recognition of the importance of latent sector failures, intra-disk redundancy and other schemes were introduced [5], but it was also recognized that protection against single failure is not sufficient. However, protection against single disk failures date back much further. Originally termed Redundant Arrays of Inexpensive Disks, Redundant Arrays of Independent Disks (RAID) originally were proposed to allow parallel I/O but were quickly recognized to solve an even more important problem, namely the tendency of magnetic hard drives to fail, often without warning. The original RAID architecture added a parity disk to an ensemble of data disks. This parity disk would store the bit-wise exclusive-or of parallel sectors in the data disks in its sectors. When a data disk failed or became otherwise inaccessible, its contents could be recovered by reading parallel sectors in all surviving disks as the bit-wise exclusive-or of the sector contents. This arrangement is called a RAID Level 4. The final architecture, the RAID Level 5, would rotate the roll of parity disk among disks and thus do away with the differences in the load. A dedicated parity disk is written with each update to any of the data disks, but not read with any look-up, so that its load is the same as that of a dedicated data disk only if the read and write loads of the ensemble fulfill a certain linear equation. Nevertheless, some important players in the storage industry use dedicated parity disks. Not only the presence of latent sector errors (discovered only upon trying to access a disk sector) but also the coupling of complete disk failures motivated the introduction of higher degrees of failure tolerance. The RAID Level 6 adds one more parity disk to the ensemble. The contents of the new parity disk have to be calculated using a more involved code, usually defined using Galois field operations, though many other possibilities exist [e.g. 6, 7]. Since its inception, the architecture of failure tolerating disk arrays has undergone constant innovation and more erasure correcting codes and disk array layouts are proposed [8, 9]. To evaluate their effectiveness, a disk array consisting of several RAID International Journal of Applied Sciences and Smart Technologies Volume 2, Issue 2, pages 91–110 p-ISSN 2655-8564, e-ISSN 2685-9432 93 Level 6 is used as a point of comparison. It seems however that no single calculation of RAID Level 6 reliability is published. This short contribution therefore aims at rectifying this lacuna. It is in part motivated by reviewing papers that make mistakes in the assessment of disk array reliability. We are right now experiencing a revolution in the storage industry where flash replaces disk and maybe soon PCM or another non- volatile memory architecture replaces both. Our analysis will remain valid independent of the chosen underlying technology. Figure 1. A disk array made up of five stripes organized as a Level 6 RAID. Each stripe consists of six data disks, a P-parity disk, and a Q-parity disk. 2 Failure Tolerance of an Ensemble of Raid Level 6 With or Without Distributed Parity We first look at the classic Level 6 RAID, depicted in Figure 1. The RAID consists of stripes ( in the figure), each with data disks and 2 parity disks. In Figure 1, we have two types of parity disks, a P-parity, calculated as the bitwise exclusive-or of the data disks in the stripe, and a Q-parity, calculated using Galois field operations. If we distribute parity, we no longer have dedicated parity disks, but each disk will have P- and Q-parity blocks in equal proportion. We will also look at two variants. The first variant uses declustering [10], where all disks are divided into disklets of equal size. To mn of the disklets in parallel position, we assign the role of data disklet, to we assign the role of P-parity disklet, and to the remaining we assign the role of Q-parity disklet in a manner that the organization of Figure 1 is true at the disklet level. Before declustering, a single or double failure in a stripe made us read almost all or all of the disks in the reliability stripe. This recovery load is thus limited to International Journal of Applied Sciences and Smart Technologies Volume 2, Issue 2, pages 91–110 p-ISSN 2655-8564, e-ISSN 2685-9432 94 a single stripe. In the declustered version, this recovery load is evenly distributed over all surviving disks. On the negative side, if three disks have failed, then it is likely (to be made more precise below) that three disklets that have failed are located in the same stripe in one of the many configurations. Thus, the possibility to withstand failures is smaller, but on the other side, recovery is faster. The second variant adds a second Q-parity to each stripe. Since there are too many contenders for the title of RAID Level 7, I just call the resulting organization a Level 6+ RAID. Now, the disk array architecture can withstand up to and including three failed disks in a single stripe without losing data. A. Reliability of the Non-Declustered Level 6 RAID We first calculate the robustness, that is, the probability with which an ensemble such as that shown in Figure 1 with stripes of disks can withstand individual disk failures. Clearly, in order to not have lost any data, each of the stripes can have at most two failures. We call this selection of f disks a benign failure pattern. Let be the number of stripes with one failed disk and be the number of stripes with two failed disks in a benign failure pattern. Clearly, or, trivially equivalent, . We now count the number of benign failure patterns with failed disks. We set . We first select stripes with a single failure out of the m stripes. Then we select stripes out of the remaining stripes. Finally, we select one of the disks in each of the stripes and two of the disks in each of the stripes. This gives us for the number of good patterns ∑ ( ) ( ) ( ) ( ) The probability of data loss is given by ( * as the opposite probability of the quotient of benign failure patterns of size f over all failure patterns of size f. Despite its seeming complexity, mathematical software like International Journal of Applied Sciences and Smart Technologies Volume 2, Issue 2, pages 91–110 p-ISSN 2655-8564, e-ISSN 2685-9432 95 Mathematica or software developed using infinite integer precision as afforded by Python is able to calculate exactly, though not instantaneously. For small f, we can obtain formulae that are easier to evaluate. For , the probability of data-loss is zero since the disk array can tolerate always two failures. For , it is easier to calculate the malignant failure patterns. Three disks can only cause data-loss if they are all located in the same stripe. We model the failures as occurring serially in time. The first failure then determines a stripe. The next two failures need to fall into the same stripe. There are ( ) possibilities to form this malign pattern out of ( ) possibilities to select two additional disks to fail. Thus, this happens with probability ( ) ( ) For , a malign failure pattern can have either four failed disks in the same stripe or three failed disks in the same stripe and another disk somewhere in another stripe. This gives ( * ( * malign patterns and the corresponding data-loss probability evaluates to . B. Reliability of the Non-Decultered Level 6+ RAID We extend our methodology to the three-failure tolerance of each single stripe. The number of benign failure patterns is obtained by first selecting positive integers and such that , representing stripes with one, two, and three different failures. Then we select stripes out of m, then stripes out of the remaining , then out of the remaining stripes, (or ( ) so far), and then for each of the stripes with one failure, one out of , for each of the stripes with two failures, two out of , and finally, for each of the stripes International Journal of Applied Sciences and Smart Technologies Volume 2, Issue 2, pages 91–110 p-ISSN 2655-8564, e-ISSN 2685-9432 96 with three failures, three out of disks, giving us ∑ ( * ( * ( * Figure 2. Data loss probability of a RAID Level 6 given failures. The array has stripes and data disks and two parity disks per stripe Figure 3. Data loss probability of a RAID Level 6 given failures. The array has stripes and data disks and two parity disks per stripe. International Journal of Applied Sciences and Smart Technologies Volume 2, Issue 2, pages 91–110 p-ISSN 2655-8564, e-ISSN 2685-9432 97 We present an example for RAID Level 6 in Figure 2 and for RAID Level 6+ in Figure 3. C. Reliability of Declustered RAID There are many ways to decluster [10] a disk array. Sometimes, people have defined layouts, assuming that all disks are equal. In modern datacenters, it is more likely that Binary Large OBjects (BLOB) or streams are being stored by grabbing disklets on different data disks and adding two or three parity disklets and then repeating the process until the object or the stream has been stored. The first method has the advantage that we can calculate data-loss probability exactly, but we can then use the result in the former case in order to approximate the second. Therefore, we assume here a pre-defined layout. In more detail, we assume that all disks in the disk array have equal size. Since reliability striping at the individual sector level (4KB) would require much meta-data and gain little in dividing recovery workload among all disks in the array, we assume that each disk is split into disklets. In order to distribute workload, the number D of disklets should be at least in the hundreds, but it does not need to be much more. Figure 4. Average deviation from fair entanglement in dependence on the number D of disklets. International Journal of Applied Sciences and Smart Technologies Volume 2, Issue 2, pages 91–110 p-ISSN 2655-8564, e-ISSN 2685-9432 98 Indeed, a disklet on a healthy disk in a disk array with one failed disk (the most common scenario) would be potentially involved only if the disklet is in the same stripe as the corresponding disklet in the failed disk. Therefore, the disklet is potentially involved with probability ⁄ , where, as we recall, is the number of stripes. In a RAID Level 6 or 6+, we only need to read of the remaining healthy and surviving disks, respectively, which gives the file system some choice, that can be used for further balancing. Even without taking advantage of this, the number of times t that a disklet on a healthy disk is entangled with the failed disk is governed by the binomial distribution: ( * ( * ( * The proportion of entangled disklets on a disk has expected value , but the average disk has a proportion at distance √ √ √ which quickly approaches zero, as we can see in Figure 4. There, we depict the average deviation of disk from optimal entanglement. By multiplying this number with three, we get bounds on the difference to optimal entanglement in which approximately of all disks fall. Because the numbers in Figure 4 are already quite small for D in the mid- hundreds, we can assume this magnitude for D. Operationally, the larger D, the more metadata needs to be kept, but also the longer it takes for a stream to fill up a disklet. In our context more importantly, the capability of a declustered RAID Level 6 or RAID Level 6+ to survive one additional disk failure beyond their guaranteed tolerance (of two and three, respectively) depends on D. If a configuration without declustering tolerates f failures with a probability of , then its declustered equivalent with D disklets per disk survives with a probability of . Because data-loss probabilities with in the case of RAID Level 6 and International Journal of Applied Sciences and Smart Technologies Volume 2, Issue 2, pages 91–110 p-ISSN 2655-8564, e-ISSN 2685-9432 99 in the case of RAID Level 6+ are so low, the data-loss probability of the declustered RAID is higher than what one (or at least me) would intuitively guess. Figure 5. Dataloss probability in the presence of three failed disks in a declustered RAID Level 6 with m stripes and n data disks per stripe in dependence on D, the number of disklets per disk Figure 6. Dataloss probability in the presence of four failed disks in a declustered RAID Level 6+ with m stripes and n data disks per stripe in dependence on D, the number of disklets per disk. International Journal of Applied Sciences and Smart Technologies Volume 2, Issue 2, pages 91–110 p-ISSN 2655-8564, e-ISSN 2685-9432 100 Figure 7. Dataloss probability in the presence of five failed disks in a declustered RAID Level 6+ with m stripes and n data disks per stripe in dependence on D, the number of disklets per disk. As we can see from Figures – , the data-loss probability in the declustered layout are substantial, but not zero for moderate values of D. This shows that a thorough reliability analysis of declustered disk arrays underestimates their reliability if it assumes that the array cannot tolerate more failures than its guaranteed tolerance, i.e. two in the case of Level 6 and three in case of Level 6+. 3 Markov Modeling Reliability should not exclusively be assessed by using data-loss probabilities in dependence on failed disks. The true measures of reliability are Mean Time to Data- Loss (MTTDL), the survival rate of data for the economic life-span of a disk array, or the annual loss rate. We obtain these numbers using a Markov or a Semi-Markov model or even a Petri net. In this section, we give the formula for calculating transition probabilities and then apply it to calculate the MTTDL for declustered arrays using the usual simplifying assumptions. International Journal of Applied Sciences and Smart Technologies Volume 2, Issue 2, pages 91–110 p-ISSN 2655-8564, e-ISSN 2685-9432 101 Figure 8. Generic Markov model A. A Generic Markov Model We should capture the current condition of a disk array in as few possible states as possible in order to simplify calculations. Often, it is possible to characterize the state of a disk array just by the current number of unavailable disks. The result is a state diagram as that presented in Figure 8. The normal, start state is State 0. There is also an absorbing state State F that characterizes a disk array that has lost data. The other states in Figure 8 represent situations where the array has suffered loss of access to a disk, but where all data stored in it can be recovered. State is a state where disks have failed. We have a failure transition from State to State corresponding to the failure of an additional disk, if that failure does not lead to loss of data, and another transition from State to State in case that the failure has induced data-loss. If data on a disk has been recovered and saved on a spare or replacement disk, then we have a repair transition from State to State . This model cannot capture all scenarios. For example, a disk array might have a small number of spared disks, but in case of failures, it might take a technician maybe weeks to replace the failed disks with new spare disks. The rate at which transitions are taken in model reflect more or less reality. In the case of failures, a common first-degree approximation is the assumption of exponential failure times. In this case, the probability that a functioning disk fails at a given small time interval is constant. (Thus, the hazard rate in the lingo of reliability theory is constant.) It is however known that disk failures times are modeled with more accuracy using a Weibull distribution, or even better, using real disk lifespans [11]. International Journal of Applied Sciences and Smart Technologies Volume 2, Issue 2, pages 91–110 p-ISSN 2655-8564, e-ISSN 2685-9432 102 Much more problematic is the modeling of repair times as exponential repair times are hardly realistic. An exponential repair time assigns positive probabilities to physically impossible short repairs and very lengthy repairs. If technicians with normal working hours are involved, times to repair will also depend on the time that the failure occurred. If both failures and repairs are exponentially distributed, then the state diagram of Figure 8 represents a true Markov model. For the researcher, this model is very attractive because it sometimes allows closed-form solutions for mean time to data-loss and similar measures, and failing that, simpler method for numerical solutions. B. Calculating State Transition Probabilities The rate with which failure transitions are taken depends on the number of disks that can fail and the probability that an additional failure will lead to data loss. In the simple case of a Markov model, these are the only contributors. We call the probability that an additional disk failure in State leads to data loss. (This is not the disk failure since the system might have experienced a successful repair.) | If denotes the probability that the system has suffered data-loss in State , then we have Thus, If we assume exponential repair times that are independent of each other and exponential failure times, this gives us a Markov model with: 1) non-failure states, where State represents a disk array with failed disks. 2) A failure state, State , which is absorbing. 3) The start state is State 0. International Journal of Applied Sciences and Smart Technologies Volume 2, Issue 2, pages 91–110 p-ISSN 2655-8564, e-ISSN 2685-9432 103 4) There is a repair transition from State to State taken with rate and fixed “repair rate” . 5) There is a failure transition not representing data-loss from State to State taken with rate . 6) There is a failure transition representing data-loss from State to State taken with rate C. MTTDL Calculation for the Markov Model If we collect the current probabilities of being in State at time in an -dimensional column vector and the transition rates into a transition matrix , then we have the fundamental Chapman-Kolmogorov equation The probability that a system has not suffered data-loss at time is given by the sum of all coefficients of , namely with a transposed column vector containing one-coefficients. The MTTDL of the system is then given as ∫ ( ) ∫ ( ) ∫ ∫ By applying the Laplacian transform to the fundamental differential equation, we obtain and after setting the new indeterminate to zero, the definition of the Laplace transform gives us International Journal of Applied Sciences and Smart Technologies Volume 2, Issue 2, pages 91–110 p-ISSN 2655-8564, e-ISSN 2685-9432 104 ∫ We then multiply with in order to obtain ∫ ∫ The value of is of course since the system starts out in State 1. It should be noted that the calculation of the inverse is not without numerical challenges as the magnitude of repair transitions is much larger than the magnitude of failure transitions, so that matrix is ill-conditioned. However, this applies only to not using good software for numerical inversion and in many cases, just using double precision suffices for reasonable precision. D. Example: Declustered RAID Levels In the case of a highly declustered RAID Level 5 (one parity per stripe), a RAID Level 6 (two parities per stripe), and a RAID Level 6+ (three parities per stripe), the Markov models are fairly simple, because then any more than one, two, and three failures, respectively, have to lead to data-loss. In the case of RAID Level 5, the fundamental differential equation for the state vector made up of the probabilities of the system being in State 0 or State 1 ( * is simply The first addend in the first equation corresponds to the failure transition, taken with rate , from State 0 to State 1, the second summand to the repair transition, taken with International Journal of Applied Sciences and Smart Technologies Volume 2, Issue 2, pages 91–110 p-ISSN 2655-8564, e-ISSN 2685-9432 105 rate from State 1 to State 0, the first summand in the second equation to the failure transition from State 0 to State 1 seen above, the second summand to the failure transition taken with rate , from State 1 to the Failure State, and the last addend to the repair transition from State 1 to State 0 again. The transition matrix is therefore ( * so that the MTTDL is simply ( * ( ) In the case of a RAID Level 6, our transition matrix needs to capture transitions between three non-failure states and has therefore shape . In particular, we obtain the MTTDL as ( ) ( + In the case of the RAID Level6+ with a very high degree of declustering, we can assume that the storage system will suffer data-loss whenever four or more disks are currently not functioning. In this case, the Markov model becomes particularly easy and the MTTDL is given below, where is the individual disk failure rate, that is, the inverse of the expected Mean Time To Failure (MTTF) of a disk, where is the repair rate, that is, the inverse of the expected Mean Time To Repair (MTTR), and the number of disks. The MTTDL is then ( ) ( , which is given in closed form as International Journal of Applied Sciences and Smart Technologies Volume 2, Issue 2, pages 91–110 p-ISSN 2655-8564, e-ISSN 2685-9432 106 Figure 9. Mean Time to Data Loss in hours for a fully declustered Level 5 (top), Level 6 (middle) and Level 6+ (bottom) RAID in dependence on the MTTF of each of the individual disks. The scales are both logarithmic. International Journal of Applied Sciences and Smart Technologies Volume 2, Issue 2, pages 91–110 p-ISSN 2655-8564, e-ISSN 2685-9432 107 To illustrate this, Figure 9 gives the MTTDL in hours for an ensemble of 90 disks, for MTTF of disks ranging between 1000 hours and 1 million hours. The MTTR are 8 hours, 36 hours, and 100 hours. E. Discussion MTTDL figures are among the least useful figures of merit for the reliability of a disk array, but have the singular advantage of often being able to be derived exactly. Other figures of merit such as the probability to survive the economic lifespan of a disk array without data-loss are more important. The fantastic figures (a MTTDL of 100 million years! with normal disks) reflect in part the lack of modeling of catastrophic events and essential component failure and in part the implicit assumption that disks fail independently. Numerically, e.g. using the Euler method to solve the fundamental differential equation of the Markov model, we can derive numbers for the survival probability of the data in an array for 5 years. Another, serious limitation of the current analysis is the concentration only on disk failures. However, as long as the merits of various disk array layouts are compared, this is more of a feature than a bug. With other words, even if one converts the MTTDL numbers into a failure rate per year, the numbers are not realistic, but they still serve to compare the different reliability merits of disk arrays. It is feasible to expand the Markov model to model latent sector errors and disk scrubbing, but this goes far beyond the goals of this paper. It is also possible to use Markov models in order to deal with infant disk mortality [12]. All these possibilities run soon into the difficulty of making good modeling assumptions. Furthermore, if one is interested in actual failure rates and not in the relative merits of disk array layouts, one might have to use actual failure statistics or at least change the Markov model to a semi-Markov model in order to model the more generally applicable Weibull failure [11]. Finally, while the large-scale storage system of the foreseeable future might be disk based, the superior performance of Flash and soon PCM memories will see the expansion of this type of very fast storage. What we have learned about device failure behavior from disks might or might not transfer to calculating the reliability of these devices. International Journal of Applied Sciences and Smart Technologies Volume 2, Issue 2, pages 91–110 p-ISSN 2655-8564, e-ISSN 2685-9432 108 4 Conclusion We defined two classical extensions of the RAID Level 5 and derived formulae for the capability of these disk arrays to survive a given number of disk failures. We then showed how to use these numbers in order to derive one particular figure of merit for the resilience of a disk array organized in this manner. No model reflects reality, but some are useful, as the saying goes. We have then shown by example how to use these calculations to determine the mean time to failure of various declustered RAID Levels. RAID Level 6 and Level 6+ can serve as a reference point in order to compare the resilience resulting from using one of the many more sophisticated codes that trade off a slight increase in parity storage overhead for faster, average recoveries. References [1] J. Gray, personal communication. [2] L. Bairavasundaram, G. Goodson, S. Pasupathy, and J. Schindler, “An analysis of latent sector errors in disk drives.” In SIGMETRICS Performance Evaluation Review, 35 (1), ACM, 289–300, 2007. [3] Bianca Schroeder, Sotirios Damouras, and Phillipa Gill, “Understanding latent sector errors and how to protect against them.” ACM Transactions on Storage (TOS) 6 (3), 1–23, 2010. [4] B. Schroeder and G.A. Gibson, “Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?” ACM Transactions on Storage (TOS) 3 (3), 1–16, 2007. [5] A. Dholakia, E. Eleftheriou, X.Y. Hu, I. Iliadis, J. Menon, and K. K. Rao, “A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors.” ACM Transactions on Storage (TOS) 4 (1), 1–42, 2008. [6] M. Blaum, J. Brady, J. Bruck, and J. Menon, “EVENODD: An efficient scheme for tolerating double disk failures in RAID architectures.” IEEE Transactions on computers 44 (2), 192–202, 1992. International Journal of Applied Sciences and Smart Technologies Volume 2, Issue 2, pages 91–110 p-ISSN 2655-8564, e-ISSN 2685-9432 109 [7] J.F. Pâris, A. Amer, and T. Schwarz, “Low-redundancy two-dimensional RAID arrays.” International Conference on Computing, Networking and Communications (ICNC), 507–511, IEEE, 2012. [8] P. Chen, E. Lee, G. Gibson, R. Katz, and D. Patterson, “RAID: High-performance, reliable secondary storage.” ACM Computing Surveys (CSUR) 26 (2), 145–185, 1992. [9] T. Schwarz and W. Burkhard, “Reliability and performance of RAIDs.” IEEE Transactions on Magnetics 31 (2), 1161–1166, 1995. [10] M. Holland and G.A. Gibson, “Parity declustering for continuous operation in redundant disk arrays.” ACM SIGPLAN Notices 27, Proceedings of ASPLOS 92 (9), 23–35, 1992. [11] J. Elerath and M. Pecht. “A highly accurate method for assessing reliability of redundant arrays of inexpensive disks (RAID).” IEEE Transactions on Computers 58 (3), 289–299, 2008. [12] Q. Xin, Q., T. Schwarz, and E. Miller. “Disk infant mortality in large storage systems.” In 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 125–134, 2005. International Journal of Applied Sciences and Smart Technologies Volume 2, Issue 2, pages 91–110 p-ISSN 2655-8564, e-ISSN 2685-9432 110 This page intentionally left blank