(INTERNAL) IEEE 754 single-precision representation
Anonym
Topic:
Doug wrote this for the UK.
Discussion:
I would recommend that your customer study the IEEE 754 Floating-point Standard. There are many Web sites with information about the standard. One such site is http://www.loria.fr/serveurs/CCH/documentation/IEEE754/ .
IEEE 754 single-precision floating-point numbers are 32-bits long. One bit is used to represent the sign of the mantissa, eight bits are used to represent the exponent, and 24 bits are used to represent the mantissa. If the reader is paying attention, these numbers add up to 33 bits. What is this extra bit? The answer is that there is a "hidden" bit. The hidden bit is "free" because all IEEE floating-point numbers are stored in normalized form. This means that the highest order place of the mantissa, the one which is the coefficient of the negative-one power of the base, is always 1. Normalization of floating-point numbers involves a shift of the mantissa to the left, with a corresponding "decrement" of the exponent, until the highest order place is 1. Since the hidden bit is always 1, it isn't necessary to "waste" one of the 32 bits to store it. Instead, only 23 bits need to be stored -- the coefficients of 2^(-2) down to 2^(-24).
The exponent part of an IEEE 754 floating-point number is a biased exponent, which allows for both positive and negative powers of two to be represented. The bias for IEEE 754 single-precision floating-point numbers is 126 decimal. Therefore, 126 represents the zero power of two, 127 represents 2^(1), 128 represents 2^(2), etc. Likewise, 125 represents 2^(-1), 124 represents 2^(-2), etc.
The product of the exponent and the mantissa is equal to the number being represented.
An example:
Consider the number 3/4, which is decimal 0.75. The IEEE 754 single-precision representation of this number is, in hexadecimal (Base 16) notation, 3f400000, where each hexadecimal place represents four binary places. Expanding this to binary, it becomes:
00111111010000000000000000000000
Now, let's spread this out, so the sign bit, the exponent, and the mantissa are easily viewed:
s exponent mantissa
0 01111110 10000000000000000000000
The sign bit is zero, as expected, since 0.75 is positive.
Converting the exponent to decimal, we get 2 + 4 + 8 + 16 + 32 + 64, which adds up to 126. Subtracting the bias, we get 126 - 126 = 0. Therefore, the exponent is 2^(0), which is 1.
Converting the mantissa to decimal, remembering the hidden bit, we get 2^(-1) + 2^(-2), which is 1/2 + 1/4, which equals 3/4.
Finally, multiplying the exponent and the mantissa together, we get 1 * 3/4, which is 3/4, or 0.75.
Regarding your customer's comment about the number 51.3, a mantissa of 513, and the mantissa divided by 10:
Since computer hardware does not operate in Base 10 arithmetic, this analysis is not correct. The number 51.3 must be represented with a binary exponent and a binary mantissa, which is not the equivalent of (10)^(2) times 0.513.
Many numbers that have exact representations in Base 10 do not have exact representations in Binary (Base 2). A classic example is 1/5, which has the exact Base 10 representation of 0.2. The previous example of 3/4 has exact representations in both bases, 0.75 in Base 10 and 0.11 in Binary.
Using your customer's example of 51.3, the IEEE 754 representation, in hexadecimal, is
424d3333
which can be expanded into binary
01000010010011010011001100110011
Separating the three pieces (sign bit, exponent, and mantissa)
s exponent mantissa
0 10000100 10011010011001100110011
The sign bit is zero (positive mantissa).
The exponent is 4 + 128 = 132. Subtracting the bias we get 132 - 126 = 6. Therefore, the exponent is 2^6 = 64.
Now, the mantissa is, including the hidden bit, 2^(-1) + 2^(-2) + 2^(-5) + 2^(-6) + 2^(-8) + 2^(-11) + 2^(-12) + 2^(-15) + 2^(-16) + 2^(-19) + 2^(-20) + 2^(-23) + 2^(-24).
If you add up all of these negative powers of two and multiply the result by the exponent (64), and print the product, rounded to one decimal place, you will get 51.3. Try it!
Now, a comment about significant places and rounding:
The IEEE 754 floating-point format (32 bits), with a mantissa of 24 bits, translates to about seven significant places of precision in Base 10. If you ask IDL to print more than seven significant places, the display cannot be rounded, properly, because the lower-order places needed to perform the rounding are missing.
The following IDL statements illustrate this:
IDL> print, 51.3, format='(f16.5)'
51.30000
IDL> print, 51.3, format='(f16.6)'
51.299999
The first statement prints 51.3, correctly rounded, with seven significant places.
In the second statement, asking IDL to display one more significant place causes the rounding process to fail, since the required lower-order places are not available.
If we ask IDL to print more than seven significant places using a double-precision floating-point representation, there are enough lower-order significant places for IDL to round the display, properly:
IDL> print, 51.3d0, format='(d16.5)'
51.30000
IDL> print, 51.3d0, format='(d16.6)'
51.300000
IDL> print, 51.3d0, format='(d16.7)'
51.3000000
IDL> print, 51.3d0, format='(d16.8)'
51.30000000
The last statement calls for ten significant places. With double-precision, about 16 significant (Base 10) places are available.