Harder than it looks: rounding float to nearest integer, part 1
Pascal Cuoq - 2nd May 2013This post is the first in a series on the difficult task of rounding a floating-point number to an integer. Laugh not! The easiest-looking questions can hide unforeseen difficulties, and the most widely accepted solutions can be wrong.
Problem
Consider the task of rounding a float
to the nearest integer. The answer is expected as a float
, same as the input. In other words, we are looking at the work done by standard C99 function nearbyintf()
when the rounding mode is the default round-to-nearest.
For the sake of simplicity, in this series of posts, we assume that the argument is positive and we allow the function to round any which way if the float argument is exactly in-between two integers. In other words, we are looking at the ACSL specification below.
/*@ requires 0 ≤ f ≤ FLT_MAX ; ensures -0.5 ≤ \esult - f ≤ 0.5 ; ensures \exists integer n; \esult == n; */ float myround(float f);
In the second ensures
clause, integer
is an ACSL type (think of it as a super-long long long
). The formula \exists integer n; \esult == n
simply means that \esult
, the float
returned by function myround()
, is a mathematical integer.
Via truncation
A first idea is to convert the argument f
to unsigned int
, and then again to float
, since that's the expected type for the result:
float myround(float f) { return (float) (unsigned int) f; }
Obvious overflow issue
One does not need Frama-C's value analysis to spot the very first issue, an overflow for large float
arguments, but it's there, so we might as well use it:
$ frama-c -pp-annot -val r.c -lib-entry -main myround ... r.c:9:[kernel] warning: overflow in conversion of f ([-0. .. 3.40282346639e+38]) from floating-point to integer. assert -1 < f < 4294967296;
This overflow can be fixed by testing for large arguments. Large floats are all integers, so the function can simply return f
in this case.
float myround(float f) { if (f >= UINT_MAX) return f; return (float) (unsigned int) f; }
Obvious rounding issue
The cast from float
to unsigned int
does not round to the nearest integer. It “truncates”, that is, it rounds towards zero. And if you already know this, you probably know too the universally used trick to obtain the nearest integer instead of the immediately smaller one, adding 0.5:
float myround(float f) { if (f >= UINT_MAX) return f; return (float) (unsigned int) (f + 0.5f); }
This universally used trick is wrong.
An issue when the ULP of the argument is exactly one
The Unit in the Last Place, or ULP for short, of a floating-point number is its distance to the floats immediately above and immediately below it. For large enough floats, this distance is one. In that range, floats behave as integers.
There is an ambiguity in the above definition for powers of two: the distances to the float immediately above and the float immediately below are not the same. If you know of a usual convention for which one is called the ULP of a power of two, please leave a note in the comments.
int main() { f = 8388609.0f; printf(\%f -> %f" f myround(f)); }
With a strict IEEE 754 compiler the simple test above produces the result below:
8388609.000000 -> 8388610.000000
The value passed as argument is obviously representable as a float since that's the type of f
. However the mathematical sum f + 0.5
does not have to be representable as a float. In the worst case for us when the argument is in a range of floats separated by exactly one the floating-point sum f + 0.5
falls exactly in-between the two representable floats f
and f + 1
. Half the time it is rounded to the latter although f
was already an integer and was the correct answer for function myround()
. This causes bugs as the one displayed above.
The range of floating-point numbers spaced every 1.0 goes from 2^23 to 2^24. Half these 2^23 values that is nearly 4 millions of them exhibit the problem.
Since the 0.5 trick is nearly universally accepted as the solution to implement rounding to nearest from truncation this bug is likely to be found in lots of places. Nicolas Cellier identified it in Squeak. He offered a solution too: switch the FPU to round-downward for the time of the addition f + 0.5
. But let us not fix the problem just yet there is another interesting input for the function as it currently stands.
An issue when the argument is exactly the predecessor of 0.5f
int main() { f = 0.49999997f; printf("%.9f -> %.9f" f myround(f)); }
This test produces the output 0.499999970 -> 1.000000000
although the input 0.49999997
is clearly closer to 0
than to 1
.
Again the issue is that the floating-point addition is not exact. The argument 0.49999997f
is the last float
of the interval [0.25 … 0.5)
. The mathematical result of f + 0.5
falls exactly midway between the last float of the interval [0.5 … 1.0)
and 1.0
. The rule that ties must be rounded to the even choice means that 1.0
is chosen.
A function that works
The overflow issue and the first non-obvious issue (when ulp(f)=1) can be fixed by the same test: as soon as the ULP of the argument is one the argument is an integer and can be returned as-is.
The second non-obvious issue with input 0.49999997f
can be fixed similarly.
float myround(float f) { if (f >= 0x1.0p23) return f; if (f <= 0.5) return 0; return (float) (unsigned int) (f + 0.5f); }
A better function that works
Changing the FPU rounding mode the suggestion in the Squeak bug report is slightly unpalatable for such a simple function but it suggests to add the predecessor of 0.5f
instead of 0.5f
to avoid the sum rounding up when it shouldn't:
float myround(float f) { if (f >= 0x1.0p23) return f; return (float) (unsigned int) (f + 0.49999997f); }
It turns out that this works too. It solves the problem with the input 0.49999997f
without making the function fail its specification for other inputs.
To be continued
The next post will approach the same question from a different angle. It should not be without its difficulties either.