[Courses] [C Doubt]

Wed Oct 9 13:12:35 EST 2002

C uses the IEEE floating point representation for doubles (and all the
rest of its floating point types).  For more than enough details,
search the web for "IEEE 754".

Here is a simplified explanation: The IEEE floating point
representation is very similar to scientific notation.  One bit is
used for the sign, some bits are used for the exponent, and some bits
are used for the mantissa.  IEEE 754 uses 2 as the base, so floating
point numbers are represented like:

mantissa * 2^(exponent)

Since we're using binary, we know that the first bit of the mantissa
will be 1 (since it's the only non-zero digit in binary), so we don't
have to store it.

On my computer, a double is 64 bits[1], so the exponent is 11 bits and
the fraction (mantissa without the first bit) is 52 bits (1 sign + 11
exponent + 52 fraction = 64).

Quick caclulations: 
2375726401805877617098752 ~= 2 * 10^24

log2(2 * 10^24)  = log2(2 * (10^3)^8) 
                ~= log2(2 * 1024^8)
                ~= log2(2 * (2^10)^8)
                ~= log2(2 * 2^80)
                ~= 81

Which means the number requires on the order of 81 bits to be
represented precisely as an integer.  Therefore, it cannot be
represented precisely in 64 bits, floating point or not.

Now, to estimate the approximate error:

The mantissa in a 64 bit float has 53 bits of precision.  Therefore 81
- 53 = 28 bits of precision are lost.  Therefore, the error can be as
much as 2^28.

2^28  = 2^8 * 2^20
      = 256 * 1024^2
     ~= 256 * (10^3)^2
     ~= 256 * 10 ^ 6
     ~= 2.56 * 10^8

which is about the same order of magnitude of your difference of about
6.8 * 10^7.

[1] according to this code:
-----
#include "stdio.h"
int main()
{
  fprintf(stderr, "%d\n", sizeof(double));
}
-----
-- 
laurel at sdf.lonestar.org
http://dreadnought.gorgorg.org