| Preface | 14 |
|---|
| List of Figures | 16 |
|---|
| List of Tables | 19 |
|---|
| I Introduction, Basic Definitions, and Standards | 22 |
|---|
| 1 Introduction | 23 |
| 1.1 Some History | 23 |
| 1.2 Desirable Properties | 26 |
| 1.3 Some Strange Behaviors | 27 |
| 1.3.1 Some famous bugs | 27 |
| 1.3.2 Difficult problems | 28 |
| 2 Definitions and Basic Notions | 33 |
| 2.1 Floating-Point Numbers | 33 |
| 2.2 Rounding | 40 |
| 2.2.1 Rounding modes | 40 |
| 2.2.2 Useful properties | 42 |
| 2.2.3 Relative error due to rounding | 43 |
| 2.3 Exceptions | 45 |
| 2.4 Lost or Preserved Properties of the Arithmetic on the Real Numbers | 47 |
| 2.5 Note on the Choice of the Radix | 49 |
| 2.5.1 Representation errors | 49 |
| 2.5.2 A case for radix 10 | 50 |
| 2.6 Tools for Manipulating Floating-Point Errors | 52 |
| 2.6.1 The ulp function | 52 |
| 2.6.2 Errors in ulps and relative errors | 57 |
| 2.6.3 An example: iterated products | 57 |
| 2.6.4 Unit roundoff | 59 |
| 2.7 Note on Radix Conversion | 60 |
| 2.7.1 Conditions on the formats | 60 |
| 2.7.2 Conversion algorithms | 63 |
| 2.8 The Fused Multiply-Add (FMA) Instruction | 71 |
| 2.9 Interval Arithmetic | 71 |
| 2.9.1 Intervals with floating-point bounds | 72 |
| 2.9.2 Optimized rounding | 72 |
| 3 Floating-Point Formats and Environment | 74 |
| 3.1 The IEEE 754-1985 Standard | 75 |
| 3.1.1 Formats specified by IEEE 754-1985 | 75 |
| 3.1.2 Little-endian, big-endian | 79 |
| 3.1.3 Rounding modes specified by IEEE 754-1985 | 80 |
| 3.1.4 Operations specified by IEEE 754-1985 | 81 |
| 3.1.5 Exceptions specified by IEEE 754-1985 | 85 |
| 3.1.6 Special values | 88 |
| 3.2 The IEEE 854-1987 Standard | 89 |
| 3.2.1 Constraints internal to a format | 89 |
| 3.2.2 Various formats and the constraints between them | 90 |
| 3.2.3 Conversions between floating-point numbers and decimal strings | 91 |
| 3.2.4 Rounding | 92 |
| 3.2.5 Operations | 92 |
| 3.2.6 Comparisons | 93 |
| 3.2.7 Exceptions | 93 |
| 3.3 The Need for a Revision | 93 |
| 3.3.1 A typical problem: ``double rounding'' | 94 |
| 3.3.2 Various ambiguities | 96 |
| 3.4 The New IEEE 754-2008 Standard | 98 |
| 3.4.1 Formats specified by the revised standard | 99 |
| 3.4.2 Binary interchange format encodings | 100 |
| 3.4.3 Decimal interchange format encodings | 101 |
| 3.4.4 Larger formats | 111 |
| 3.4.5 Extended and extendable precisions | 111 |
| 3.4.6 Attributes | 112 |
| 3.4.7 Operations specified by the standard | 116 |
| 3.4.8 Comparisons | 118 |
| 3.4.9 Conversions | 118 |
| 3.4.10 Default exception handling | 119 |
| 3.4.11 Recommended transcendental functions | 122 |
| 3.5 Floating-Point Hardware in Current Processors | 123 |
| 3.5.1 The common hardware denominator | 123 |
| 3.5.2 Fused multiply-add | 123 |
| 3.5.3 Extended precision | 123 |
| 3.5.4 Rounding and precision control | 124 |
| 3.5.5 SIMD instructions | 125 |
| 3.5.6 Floating-point on x86 processors: SSE2 versus x87 | 125 |
| 3.5.7 Decimal arithmetic | 126 |
| 3.6 Floating-Point Hardware in Recent GraphicsProcessing Units | 127 |
| 3.7 Relations with Programming Languages | 128 |
| 3.7.1 The Language Independent Arithmetic (LIA) standard | 128 |
| 3.7.2 Programming languages | 129 |
| 3.8 Checking the Environment | 129 |
| 3.8.1 MACHAR | 130 |
| 3.8.2 Paranoia | 130 |
| 3.8.3 UCBTest | 134 |
| 3.8.4 TestFloat | 135 |
| 3.8.5 IeeeCC754 | 135 |
| 3.8.6 Miscellaneous | 135 |
| II Cleverly Using Floating-Point Arithmetic | 136 |
|---|
| 4 Basic Properties and Algorithms | 137 |
| 4.1 Testing the Computational Environment | 137 |
| 4.1.1 Computing the radix | 137 |
| 4.1.2 Computing the precision | 139 |
| 4.2 Exact Operations | 140 |
| 4.2.1 Exact addition | 140 |
| 4.2.2 Exact multiplications and divisions | 142 |
| 4.3 Accurate Computations of Sums of Two Numbers | 143 |
| 4.3.1 The Fast2Sum algorithm | 144 |
| 4.3.2 The 2Sum algorithm | 147 |
| 4.3.3 If we do not use rounding to nearest | 149 |
| 4.4 Computation of Products | 150 |
| 4.4.1 Veltkamp splitting | 150 |
| 4.4.2 Dekker's multiplication algorithm | 153 |
| 4.5 Complex numbers | 157 |
| 4.5.1 Various error bounds | 158 |
| 4.5.2 Error bound for complex multiplication | 159 |
| 4.5.3 Complex division | 162 |
| 4.5.4 Complex square root | 167 |
| 5 The Fused Multiply-Add Instruction | 169 |
| 5.1 The 2MultFMA Algorithm | 170 |
| 5.2 Computation of Residuals of Division and Square Root | 171 |
| 5.3 Newton--Raphson-Based Division with an FMA | 173 |
| 5.3.1 Variants of the Newton--Raphson iteration | 173 |
| 5.3.2 Using the Newton--Raphson iteration for correctly rounded division | 178 |
| 5.4 Newton--Raphson-Based Square Root with an FMA | 185 |
| 5.4.1 The basic iterations | 185 |
| 5.4.2 Using the Newton--Raphson iteration for correctly rounded square roots | 186 |
| 5.5 Multiplication by an Arbitrary-Precision Constant | 189 |
| 5.5.1 Checking for a given constant C if Algorithm 5.2 will always work | 190 |
| 5.6 Evaluation of the Error of an FMA | 193 |
| 5.7 Evaluation of Integer Powers | 195 |
| 6 Enhanced Floating-Point Sums, Dot Products, and Polynomial Values | 198 |
| 6.1 Preliminaries | 199 |
| 6.1.1 Floating-point arithmetic models | 200 |
| 6.1.2 Notation for error analysis and classical error estimates | 201 |
| 6.1.3 Properties for deriving validated running error bounds | 204 |
| 6.2 Computing Validated Running Error Bounds | 205 |
| 6.3 Computing Sums More Accurately | 207 |
| 6.3.1 Reordering the operands, and a bit more | 207 |
| 6.3.2 Compensated sums | 209 |
| 6.3.3 Implementing a ``long accumulator'' | 216 |
| 6.3.4 On the sum of three floating-point numbers | 216 |
| 6.4 Compensated Dot Products | 218 |
| 6.5 Compensated Polynomial Evaluation | 220 |
| 7 Languages and Compilers | 222 |
| 7.1 A Play with Many Actors | 222 |
| 7.1.1 Floating-point evaluation in programming languages | 223 |
| 7.1.2 Processors, compilers, and operating systems | 225 |
| 7.1.3 In the hands of the programmer | 226 |