ap-v2.dvi Acta Polytechnica Vol. 50 No. 2/2010 FPU-Supported Running Error Analysis T. Zahradnický, R. Lórencz Abstract A-posteriori forward rounding error analyses tend to give sharper error estimates than a-priori ones, as they use actual data quantities. One of such a-posteriori analysis – running error analysis – uses expressions consisting of twoparts; one generates the error and the other propagates input errors to the output. This paper suggests replacing the error generating term with an FPU-extracted rounding error estimate, which produces a sharper error bound. Keywords: Analysis of algorithms, a-posteriori error estimates, running error analysis, floating point, numerical stability. 1 Introduction Rounding error analyses are used to discover the sta- bility of numerical algorithms and provide information on whether the computed result is valid. They are typ- ically performed in forward or backward direction, and either a-priori or a-posteriori. Backward error analy- sis [6] treats all operations as if they were exact but with a perturbed data, and we ask for which input data we have solved our problem. Using backward error analysis is preferred, as each algorithm which is backward stable is automatically numerically sta- ble [5], while this does not hold for forward error anal- yses. On the other hand, forward error analyses treat operations as sources of errors, and tell us about the size of the solution that our input data corresponds to. Since error analyses performed a-priori can often be difficult if the algorithm being analyzed is complex, a-posteriori analyses may provide a more feasible al- ternative (MSC 2000 Classification: 65G20, 65G50). An objective of an a-priori analysis is to find the worst case bound for an algorithm, if it exists, while a-posteriori analyses calculate the error bound concur- rently with the evaluation of the result and work with actual data quantities, providing a tighter error bound. This paper presents a replacement for an error gen- erating term in error bounds calculated with forward a-posteriori error estimates (also known as running er- ror analysis) to obtain sharper error bounds by exploit- ing the behavior of the computer FPU. We expect the calculation to be portable (conforming strictly to [3]) and assume an Intel x86 compatible FPU which sup- ports double extended precision (80-bit) floating point registers, which are essential for this method. 2 Running error analysis and error bounds Running error analysis [6, 7] is a type of a-posteriori forward error analysis. The algorithm being analyzed is extended to calculate the partial error bound along- side the normal calculation. As the algorithm pro- ceeds, bounds are accumulated, making a total error bound estimate. From here on, a binary double preci- sion floating point arithmetic with a guard digit with round to nearest rounding mode is assumed. Definition 1 Let Ft ⊂ Q be a binary floating point set with a precision of t, where Q denotes the rational number set. Definition 2 Let y = ±m 2e−t ∈ Ft be a binary floating point number, where m stands for mantissa, e for exponent, and t for precision. Let ex(y) = e. Lemma 3 Let x̂ = fl(x) be a floating point repre- sentation of an exact number x, which is obtained by rounding x to the nearest element in Ft. The rounding process can be described as: x̂ = fl(x) = x(1 + δ), |δ| ≤ ut, (1) where ut = 2 −t is unit roundoff in precision t. To conserve some space, and for clarity, u without a sub- script means u53. Assumption 4 (Standard model) We assume that operations ♦ ∈ {+, −, ·, /} and the square root follow the Standardmodel [5] and are evaluated with error no greater than u: ŝ♦ = fl(â ♦ b̂) = (â ♦ b̂)/(1 + δ♦), |δ♦| ≤ u, ŝ√ = fl( √ â) = √ â/(1 + δ√ ), |δ√ | ≤ u, (2) where we use ŝ as a computed result of an operation. The Standard model will also be used in the form of ŝ♦ = fl(â♦b̂) = (â♦b̂) · (1 + δ♦), where useful. Table 1 summarizes the error bounds for basic op- erations and the square root with products of errors being neglected. We also assume that the following holds: ŝ = s/(1 + δs) = s + σ, â = a/(1 + δa) = a + α, and b̂ = b/(1 + δb) = b + β, where α, β, and σ stand for absolute (static) errors. 30 Acta Polytechnica Vol. 50 No. 2/2010 Table 1: Error bound estimates for basic operations Operation Error bound Addition and subtraction |σ±| ≤ u |â ± b̂| + α + β Multiplication |σ·| ≤ u |âb̂| + α|̂b| + β|â| Division |σ/| ≤ u ∣∣∣∣â b̂ ∣∣∣∣ + α|̂b| + β|â| b̂2 Square root |σ√ | ≤ u ∣∣∣√â∣∣∣ + α 2 ∣∣∣√â∣∣∣ The right-hand side of each error bound in Tab. 1 contains an error generating term u| · |, while the rest of each expression simply propagates input error(s) to the output. When we substitute u = 2−53 into u|ŝ|, multiplying by u reduces the exponent by 53 u|ŝ| = 1, xx . . . xx︸ ︷︷ ︸ 52 times ·2ex(̂s)−2t, (3) leaving the mantissa unchanged if u|ŝ| is normal. The roundoff unit is just the first 1 in (3), while x-en are generally nonzero. Using u|ŝ| as a rounding error esti- mate features two problems: 1. The calculated error bound u|ŝ| can be up to two times larger ( u ∑t−1 i=0 ŝi2 −i<∼2u ) , assuming that ŝi refers to the i-th bit of ŝ. 2. The error generation term always generates an error even in the case when no physical error was committed. The expression u|ŝ| tends to give a higher error bound estimate than we would need. However, we can revise this expression if we are interested in obtaining a tighter error bound estimate, and we can do so by exploiting the FPU. 3 Analysis Many computers use a processor with Intel x86 ar- chitecture [4], which features an FPU with 8 double extended precision floating point registers. Once a number is loaded into the FPU, regardless of its pre- cision, it gets automatically converted into the dou- ble extended precision format. Further calculations are performed in this format but rounded to a preci- sion specified in the floating point control word register (FCW). The default settings for FCW specify double ex- tended precision, round to the nearest rounding mode and mask out all floating point exceptions. If neces- sary, they can be changed with an fstcw instruction. Once the result gets stored back into a memory lo- cation, it is rounded to (or extended if FCW specified single precision) a defined precision, depending on the type of store instruction. 3.1 Obtaining a tighter bound from the FPU When rounding to the nearest element of Ft, the rela- tive error is no worse than ut, and we can extract the static error from the FPU by subtracting the computed value in double extended precision from its rounded value. This idea looks similar to compensated sum- mation [5], which uses an entirely software approach in contrast to our paper. Subtracting the values pro- vides an 11-bit error estimate of the real rounding er- ror, which is no greater than u. The entire idea can be written as: Theorem 5 Let Ŝ be the result of an operation per- formed in double extended precision and let ŝ be the result in double precision, which we obtained by round- ing Ŝ to 53 bits by rounding to the nearest as defined by the IEEE 754 Standard. The absolute rounding er- ror for an operation in floating point can be calculated rather as Δ = |ŝ − Ŝ|, where |Δ| ≤ u|ŝ|. PROOF. We can extract 64-bit mantissa intermedi- ate results (Ŝ) from the FPU before they get rounded to 53 bits (ŝ), and calculate Δ = |ŝ − Ŝ|. The quantity Ŝ is rounded either down to f1 ∈ F53 or up to f2 ∈ F53, where f1, f2 = {f1, f2 ∈ F53 : |f2 − f1|/|f1| = 2u}, Ŝ ∈ F64 ∧ Ŝ ∈ [f1, f2], ŝ = round 64→53 (Ŝ) ⇒ ŝ ∈ {f1, f2}, where ŝ = round 64→53 (Ŝ) = ⎧⎪⎪⎨ ⎪⎪⎩ f2 if |Ŝ − f1|/|f1| > u f1 or f2 1 if |Ŝ − f1|/|f1| = u f1 if |Ŝ − f1|/|f1| < u . (4) The following figure shows the principle: f1 f2 � � 2 u � � round down � � � tie (round to even) � � round up Fig. 1: Rounding a double extended precision number to double precision. In all cases, the relative error can be computed in the same way as |ŝ − Ŝ|/|ŝ| ≤ u and is always bounded by u. � 1Rounding to even as in [3] applies here. 31 Acta Polytechnica Vol. 50 No. 2/2010 Note 1 Theorem 5 is proved for positive numbers, but it can be proved for negatives too. Corollary 6 The expression Δ = |ŝ − Ŝ| ≤ u|ŝ| pro- vides an 11-bit estimate of the static rounding error and since |Δ| ≤ u|ŝ|, it can be safely substituted for all occurrences of u|ŝ| in Tab. 1, providing a tighter error bound. 3.2 Implementation highlights Our approach is implemented in C++ language as a class fdouble with GCC AT&T inline assembly lan- guage FPU statements [9]. The fdouble class over- loads most operators and some standard functions2 which are beyond the scope of this paper, providing a handy replacement for the double data type. The transition from double data type to fdouble is per- formed just by changing the name of the data type. The rest is handled by the class. For an illustration of how the class works, the source of a plus oper- ator fdouble::operator+ and its assembly portion op_plus performing the FPU-supported running er- ror analysis with the terms above follows: fdouble fdouble::operator+(fdouble const &b) const { fdouble r; op_plus( r.d, r.e, this->d, this->e, b.d, b.e ); return r; } The fdouble::operator+ method calls the op_plus function which contains the assembly inline and is common for all other fdouble::operator+ overloads: inline void op_plus( double& result, double& result_err, register const double a, register const double a_err, register const double b, register const double b_err ) { __asm__ __volatile__( "fldl %1\n\t" // 1. load b "fldl %2\n\t" // 2. load a "faddp\n\t" // 3. calc a+b "fstl %3\n\t" // 4. store rounded result "fldl %3\n\t" // 5. load rounded result "fsubp\n\t" // 6. calc difference "fabs\n\t" // 7. calc its absolute value "fldl %4\n\t" // 8. load a_err "faddp\n\t" // 9. calc diff + a_err "fldl %5\n\t" // 10. load b_err "faddp" // 11. calc diff + a_err + b_err : "=t"(result_err) // 12. put result in result_err : "m"(b), // 13. %1 = b "m"(a), // 14. %2 = a "m"(result), // 15. %3 = result "m"(a_err), // 16. %4 = a_err "m"(b_err) // 17. %5 = b_err ); } 2Currently only logarithm, power and exponential function, absolute value, remainder and the square root are available. 32 Acta Polytechnica Vol. 50 No. 2/2010 The op_plus first loads both double precision a and b (lines 1 and 2) on the floating point stack and the FPU extends them to the double extended pre- cision. The sum is then calculated (line 3) popping a from the FPU stack and replacing b by the sum. Since the result is still in double extended precision, it has to be stored back into memory in order to get a rounded result according to the Standard that we have to round the result (Ŝ) after each operation (line 4) and rounded result (ŝ) is pushed back onto the floating point stack (line 5). Ŝ − ŝ is calculated (line 6) and its absolute value is determined (line 7). Errors are propagated according to Tab. 1 (lines 8 through 11), and the requested absolute error is found at the top of the FPU stack (line 12). Lines 13 through 17 specify input operand constraints for the __asm__ directive. Other operators and functions are implemented in a similar way. Traditional running error analysis is implemented alike, and provides class radouble. The source code for radouble::operator+ follows: radouble radouble::operator+(radouble const &b) const { radouble r; r.d = d + b.d; r.e = fabs(r.d) * ROUNDOFF_UNIT + e + b.e; return r; } Here, we can see that no assembly language in- lines are necessary, but it is necessary to provide spe- cial compiler flags in order to obtain the same re- sults with the fdouble class. These GCC flags are: -ffloat-store, which ensures that the result of each floating point operation is stored back into memory and rounded to a defined precision (required by IEEE 754), and the second flag -mfpmath=387, which assures that the floating point unit is used. Without the sec- ond flag, the optimization could use SSE instructions and provide less accurate results. Readers interested in obtaining the complete source code for all classes should refer to [9] but cite this paper. 3.3 Complexity and implementation notes The traditional running error analysis needs to calcu- late u|ŝ|, and four x86 floating point instructions are necessary3. These are: fld instruction to load u onto the floating point stack, fmul to calculate uŝ, fabs to obtain |uŝ| = u|ŝ|, and fstl to store the result back into memory and round it to double precision. Our approach suggests to substitute |ŝ− Ŝ| for u|ŝ| and a typical evaluation requires an fldl instruction to load ŝ onto the floating point stack, fsub to calcu- late ŝ−Ŝ, fabs to obtain its absolute value |ŝ−Ŝ|, and fstl to store the result into memory while rounding it to double precision. According to instruction latency tables [1], fsub instruction latency is 3 cycles, while fmul latency is 5. The following table compares the two approaches from the latency point of view, proving that there is a speed enhancement: Table 2: Comparing the FPU instructions necessary to evaluate the absolute error Traditional approach Our approach Instructions Latency Instructions Latency fldl 3 fldl 3 fmul 5 fsub 3 fabs 1 fabs 1 fstl 3 fstl 3 Total 12 Total 10 Our approach not only provides a better error bound estimate, but also at a higher speed and we should expect about 20 per cent speed up. 3.4 Case study The case study compares the presented approach to traditional running error analysis, and also compares it to a forward error bound determined a-priori on the following mathematical identity: y∞ = log 2 = ∞∑ k=1 (−1)k−1 k . (5) C++ and Mathematica [8] software are used to verify the results. A rounding error analysis of (5) for a dou- ble precision floating point arithmetic for N addends is given with the following assumptions: 3This approach works only with FPU, not SSE2/SSE3, as SSE does not support double extended precision and we are unable to calculate |Ŝ − ŝ|. 33 Acta Polytechnica Vol. 50 No. 2/2010 −8 −6 −2 −4 0 L og ar it h m of er ro r Running error Our approach Result error Forward error bound log2 N −18 −16 −14 −12 −10 −8 −6 −4 −2 0 0 4 8 12 16 20 24 28 32 36 −18 −16 −14 −12 −10 Fig. 2: Calculated value of ŷN and its error bounds (forward sum direction) 1. The first two addends (1 and 1/2) of the sum are exact as we use a binary floating point arith- metic, 2. (−1)k−1 is evaluated as (−1)k−1 = ⎧⎨ ⎩ 1 when k is odd−1 otherwise rather than using pow function from libm thus the result is always exact, 3. all operations follow the Standard model. The computed ŷN is expressed as: ŷN = [ · · · [( 1 − 1 2 ) (1 + δ1) + 1 3 (1 + δ2) ] (1 + δ3) − · · ·] (1 + δ2N−3) = (6) = 1(1 + θN ) − 1 2 (1 + θN ) + 1 3 (1 + θN ) − . . . + (−1)N−1 N (1 + θ2) ≤ (7) ≤ ( 1 − 1 2 + 1 3 − 1 4 + − · · · + (−1)N−1 N ) (1 + θN ) = (1 + θN ) N∑ k=1 (−1)k−1 k , (8) where |δi| ≤ u, n∏ i=1 (1 + δi) ρi = 1 + θn, where ρi = ±1, and |θn| ≤ nu 1 − nu = γn, and nu < 1 [5]. The backward error is immediately visible from equation (7), in which we can consider the sum as an exact sum of perturbed data entries by a relative value certainly bounded by γn. The forward a-priori error bound is then calculated as: |ŷN − yN | ≤ γN N∑ k=1 (−1)k−1 k , (9) and presents the worst case error estimate, which can be far from true error bound. The following section demonstrates this statement. 3.4.1 Results Results are provided for two summation orders. The first case performs the summation in decreasing order of magnitude, where a poor, insufficiently accurate re- sult is quickly visible. The figure 2 depicts this sce- nario, where N stands for number of iterations and the y-axis shows a base 10 logarithm of the order of the error. The first line, marked with �, presents the absolute error, which is log10 (|ŷN − log 2|) and it gets smaller with each further accumulation of the sum. Rounding errors go against this error from bottom to top, and are represented with the remaining three lines. From top to bottom: The ×+line presents the worst case, that is, the a-priori forward error bound which is obtained from the right hand side of (9). The second line, marked with +, stands for an a-posteriori running error bound. We can see that it is slightly better than the a-priori bound, and the last line (×) presents our approach, which accounts real rounding errors, and that is the reason why the first two itera- tions have no error (cf. the × at the top). We can also observe that if we use an a-priori bound, we will have to terminate the accumulation somewhere after 226 iterations, and after 227 iterations with the running error bound and approximately 228 iterations with our approach. Calculating more iter- ations beyond the bound does not make sense in this case, because each accumulant is smaller than the er- ror of the result. 34 Acta Polytechnica Vol. 50 No. 2/2010 444036322824 Result error Forward error bound Our approach 20 Running error −18 8 1612 0 −2 −4 −6 −8 −10 −12 −14 −16 −14 −12 −10 −8 −6 −4 −2 0 0 4 48 −18 log2 N L og ar it h m of er ro r 52 −16 Fig. 3: Calculated value of ŷN and its error bounds (reverse sum direction) The second case performs the sum in the reverse order, i.e. going from numbers of the smallest mag- nitude towards greater numbers (see figure 3). Note that this scenario demonstrates the claim that an er- ror bound obtained a-priori (i.e. the forward error bound) is often too pessimistic and presents the worst case. Using it would lead to premature termination of the sum evaluation. One more thing is demonstrated, i.e., that we should accumulate numbers in increasing order of mag- nitude. If we do not do that, numbers with small mag- nitude start to contribute less and less to a proportion- ally huge sum value, and we sooner or later find that further additions do not change the value of the sum. Due to this fact, the harmonic sequence ∞∑ k=1 1 k = ∞ converges in finite arithmetics, and has a finite sum depending on the type of arithmetic. The following table presents selected portions of the evaluation time for all two a-posteriori error anal- yses including the original computation. In addition, the table shows the run time for the objdouble class, which wraps the double data class that is used af- ter subtracting the double column to determine the amount of the C++ object overhead. The results were obtained with the POSIX getrusage API which provides, besides other informa- tion, the amount of used time in seconds and microsec- onds since the start of the process. For measuring purposes, these are internally converted into millisec- onds and a difference of two millisecond values per- forms a measurement without interference with other processes and the operating system. Each value was measured 50 times and the results were statistically evaluated [2]. When we compare the traditional running error analysis run time to a simple evaluation in double precision, we obtain that object-oriented running error analysis with the radouble class is approximately 2.81 times slower. This slowdown includes the C++ over- head, consisting of object construction and operator calls. The overhead was measured with the objdouble class and the results show that the evaluation with objdouble class took 2 times longer than with the double data type. Our approach is only 2.35 times slower than the original approach with the double data type, and it is 1.19 times faster than the tradi- tional approach. This speed up is what we have been expecting from Tab. 2. Table 3: Run times for reverse sum evaluations of selected N operations with four data types. These are double data type with no error analysis, the radouble data type, an object-oriented traditional running error analysis, fdouble data type which performs FPU-supported running error analysis, and the objdouble class, which wraps double data type and is used to measure the C++ overhead. All values are rounded to whole milliseconds log2 N double radouble fdouble objdouble 16 1 3 3 2 20 8 50 42 36 24 285 803 672 573 28 4 568 12 849 10 777 9 178 32 73 120 205 726 172 538 146 847 4 Conclusions The traditional running error analysis approach uses the u|ŝ| term to get a rounding error estimate and, as we have shown, this estimate can be up to two times 35 Acta Polytechnica Vol. 50 No. 2/2010 larger than the actual rounding error. Moreover, this term always generates an error, regardless of whether an error was physically committed. By exploiting the floating point unit’s behavior of the Intel x86 platform, we are able to obtain an 11- bit error estimate by subtracting the rounded result ŝ from its not yet rounded equivalent Ŝ, and when di- vided by Ŝ, this estimate is always less than or equal to the unit roundoff u. Our approach is very similar to the traditional approach; it also needs four FPU instructions, but it replaces multiplication by subtrac- tion and that saves 2 CPU cycles per evaluation, ob- taining almost 20 per cent speedup over traditional running error analysis. We encourage the use of running error analysis in all iterative tasks where critical cancellation can oc- cur, such as during evaluation of a numeric derivative. As we have seen in the case study, an error bound determined a-priori can be far from the actual error and, especially with the provided classes, running er- ror analysis provides a very quick and feasible replace- ment. This, however, costs 2.85 times the evaluation time, but when replaced by FPU-supported running error analysis, the cost is 2.35 while providing a yet tighter bound. With a tighter bound, we are able to calculate more iterations and be sure that the result is still valid. Acknowledgement This research has been partially supported by the Min- istry of Education, Youth, and Sport of the Czech Republic, under research program MSM 6840770014, and by the Czech Science Foundation as project No. 201/06/1039. References [1] Fog, A.: Instruction Tables: Lists of Instruc- tion Latencies, Throughputs and Micro-operation Breakdowns for Intel and AMD CPU’s, Copen- hagen University College of Engineering, avail- able online as a part of Software Optimization Re- sources, http://www.agner.org/optimize/, 2008. [2] Bevington, P., Robinson, D. K.: Data Reduc- tion and Error Analysis for the Physical Sciences, McGraw-Hill Science/Engineering/Math, 2002. [3] IEEE Computer Society Standards Committee. Working group of the Microprocessor Standards Subcommittee and American National Standards Institute: IEEE Standard for Binary Floating- point Arithmetic, ser. ANSI/IEEE Std 754-1985. IEEE Computer Society Press, 1985. [4] Intel Corporation: Intel R© 64 and IA-32 Architec- tures Software Developer’s Manual, vol. 3A, Sys- tem Programming Guide, Part 1, 11/2007. [5] Higham, N. J.: Accuracy and Stability of Numeri- cal Algorithms, 2nd edition, Society for Industrial and Applied Mathematics, 2002. [6] Wilkinson, J. H.: The State of the Art in Error Analysis, NAG Newsletter, vol. 2/85, 1985. [7] Wilkinson, J. H.: Error Analysis Revisited, IMA Bulletin, vol. 22, no. 11/12, pp. 192–200, 1986. [8] Wolfram Research, Inc.: Mathematica 7, http://www.wolfram.com/, 2008. [9] Zahradnický, T.: Fdouble Class, A Double Data Type Replacement C++ Class with the FPU- Supported Running Error Analysis. Accessible on- line at http://service.felk.cvut.cz/anc/zahradt/ fdouble.tar.gz, 2007. Ing. Tomáš Zahradnický was born on 9th March 1979, in Prague, Czech Repub- lic. In 2003 he graduated (MSc) from the Department of Computer Science and Engineering of the Faculty of Electrical Engineering at the Czech Technical Univer- sity in Prague. Since 2004 he has been a postgraduate student at the same department, where he became an assistant professor in 2007. Since 2009 he has been an assistant professor at the Department of Computer Systems at the Faculty of Information Technologies at the Czech Technical University in Prague. His scien- tific research focuses on system optimization, and on parameter extraction. Doc. Ing. Róbert Lórencz, CSc. was born on 10th August 1957 in Prešov, Slovak Re- public. In 1981 he graduated (MSc) from the Fac- ulty of Electrical Engineering at the Czech Technical University in Prague. He received his Ph.D. degree in 1990 from the Slovak Academy of Sciences. In 1998 he became an assistant professor at the Department of Computer Science and Engineering at the Faculty of Electrical Engineering at the Czech Technical Univer- sity in Prague. In 2005 he defended his habilitation thesis and became an associate professor at the same department. Since 2009, he has worked as an associate professor at the Department of Computer Systems at the Faculty of Information Technologies at the Czech Technical University in Prague. His scientific research focuses on arithmetics for numerical algorithms, resid- ual arithmetic, error-free computation and cryptogra- phy. Ing. Tomáš Zahradnický Doc. Ing. Róbert Lórencz, CSc. E-mail: zahradt|lorencz@fit.cvut.cz Department of Computer Systems Faculty of Information Technologies Czech Technical University in Prague Kolejńı 550/2, 160 00 Prague, Czech Republic 36