Rounding float to nearest integer, part 3

Pascal Cuoq - 4th May 2013

Two earlier posts showed two different approaches in order to round a float to the nearest integer. The first was to truncate to integer after having added the right quantity (either 0.5 if the programmer is willing to take care of a few dangerous inputs beforehand or the predecessor of 0.5 so as to have fewer dangerous inputs to watch for).

The second approach was to mess with the representation of the float input trying to recognize where the bits for the fractional part were deciding whether they represented less or more than one half and either zeroing them (in the first case) or sending the float up to the nearest integer (in the second case) which was simple for complicated reasons.

Variations on the first method

Several persons have suggested smart variations on the first theme included here for the sake of completeness. The first suggestion is as follows (remembering that the input f is assumed to be positive and ignoring overflow issues for simplicity):

float myround(float f) 
{ 
  float candidate = (float) (unsigned int) f; 
  if (f - candidate <= 0.5) return candidate; 
  return candidate + 1.0f; 
} 

Other suggestions were to use modff() that separates a floating-point number into its integral and fractional components or fmodf(f 1.0f) that computes the remainder of f in the division by 1.

These three solutions work better than adding 0.5 for a reason that is simple if one only looks at it superficially: floating-point numbers are denser around zero. Adding 0.5 takes us away from zero whereas operations f - candidate modff(f iptr) and fmodf(f 1.0) take us closer to zero in a range where the answer can be exactly represented so it is. (Note: this is a super-superficial explanation.)

A third method

General idea: the power of two giveth and the power of two taketh away

The third and generally most efficient method for rounding f to the nearest integer is to take advantage of this marvelous rounding machine that is IEEE 754 arithmetic. But for this to work the exact right machine is needed that is a C compiler that implements strict IEEE 754 arithmetic and rounds each operation to the precision of the type. If you are using GCC consider using options -msse2 -mfpmath=sse.

We already noticed that single-precision floats between 2^23 and 2^24 are all the integers in this range. If we add some quantity to f so that the result ends up in this range wouldn't it follow that the result obtained will be rounded to the integer? And it would be rounded in round-to-nearest. Exactly what we are looking for:

  f                 f + 8388608.0f 
_____________________________ 
0.0f                8388608.0f 
0.1f                8388608.0f 
0.5f                8388608.0f 
0.9f                8388609.0f 
1.0f                8388609.0f 
1.1f                8388609.0f 
1.5f                8388610.0f 
1.9f                8388610.0f 
2.0f                8388610.0f 
2.1f                8388610.0f 

The rounding part goes well but now we are stuck with large numbers far from the input and from the expected output. Let us try to get back close to zero by subtracting 8388608.0f again:

  f                 f + 8388608.0f             f + 8388608.0f - 8388608.0f 
____________________________________________________________________ 
0.0f                8388608.0f                              0.0f 
0.1f                8388608.0f                              0.0f 
0.5f                8388608.0f                              0.0f 
0.9f                8388609.0f                              1.0f 
1.0f                8388609.0f                              1.0f 
1.1f                8388609.0f                              1.0f 
1.5f                8388610.0f                              2.0f 
1.9f                8388610.0f                              2.0f 
2.0f                8388610.0f                              2.0f 
2.1f                8388610.0f                              2.0f 

It works! The subtraction is exact for the same kind of reason that was informally sketched for f - candidate. Adding 8388608.0f causes the result to be rounded to the unit and then subtracting it is exact producing a float that is exactly the original rounded to the nearest integer.

For these inputs anyway. For very large inputs the situation is different.

Very large inputs: absorption

  f                 f + 8388608.0f             f + 8388608.0f - 8388608.0f 
____________________________________________________________________ 
1e28f                  1e28f                               1e28f 
1e29f                  1e29f                               1e29f 
1e30f                  1e30f                               1e30f 
1e31f                  1e31f                               1e31f 

When f is large enough adding 8388608.0f to it does nothing and then subtracting 8388608.0f from it does nothing again. This is good news because we are dealing with very large single-precision floats that are already integers and can be returned directly as the result of our function myround().

In fact since we entirely avoided converting to a range-challenged integer type and since adding 8388608.0f to FLT_MAX does not make it overflow (we have been assuming the FPU was in round-to-nearest mode all this time remember?) we could even caress the dream of a straightforward myround() with a single execution path. Small floats rounded to the nearest integer and taken back near zero where they belong large floats returned unchanged by the addition and the subtraction of a comparatively small quantity (with respect to them).

Dreams crushed

Unfortunately although adding and subtracting 2^23 almost always does what we expect (it does for inputs up to 2^23 and above 2^47) there is a range of values for which it does not work. An example:

  f                 f + 8388608.0f             f + 8388608.0f - 8388608.0f 
____________________________________________________________________ 
8388609.0f          16777216.0f                        8388608.0f 

In order for function myround() to work correctly for all inputs it still needs a conditional. The simplest is to put aside inputs larger than 2^23 that are all integers and to use the addition-subtraction trick for the others:

float myround(float f) 
{ 
  if (f >= 0x1.0p23) 
    return f; 
  return f + 0x1.0p23f - 0x1.0p23f; 
} 

The function above in round-to-nearest mode satisfies the contract we initially set out to fulfill. Interestingly if the rounding mode is other than round-to-nearest then it still rounds to a nearby integer but according to the FPU rounding mode. This is a consequence of the fact that the only inexact operation is the addition. The subtraction being exact is not affected by the rounding mode.

For instance if the FPU is set to round downwards and the argument f is 0.9f then f + 8388608.0f produces 8388608.0f and f + 8388608.0f - 8388608.0f produces zero.

Conclusion

This post concludes the “rounding float to nearest integer” series. The method highlighted in this third post is actually the method generally used for function rintf() because the floating-point addition has the effect of setting the “inexact” FPU flag when it is inexact which is exactly when the function returns an output other than its input which is when rintf() is specified as setting the “inexact” flag.

Function nearbyintf() is specified as not touching the FPU flags and would typically be implemented with the method from the second post.

Pascal Cuoq
4th May 2013