undefined

points

[-]

That paper explains some useful optimisation details, but obviously since the floats are all (either infinity or) some multiple of a known tiny fraction (their smallest non-zero number), we can definitely sum them accurately.

by mjcohen1 hours ago|

parent|

[-]

Not if the ratio between the largest and smallest floats is very large (2^(2^n)) where n is the number of bits in the exponent.

by tialaramex42 seconds ago|

parent|

[-]

I think you either haven't thought about this or you did your math wrong.

You need (2^e)+m+1 bits. That is more bits than would fit in the cheap machine integer type you just have lying around, but it's not that many in real terms.

Let's do a tiny one to see though first, the "half-precision" or f16 type, 5 bits of exponent, 10 bits of fraction, 1 sign bit. We need 43 bits. This will actually fit in the 64-bit signed integer type on a modern CPU.

Now lets try f64, the big daddy, 11 exponent, 52 fraction, 1 sign bit so total 2060 bits. As I said it doesn't fit in our machine integer types but it's much smaller than a kilobyte of RAM.