← Harder than it looks: rounding float to nearest integer, part 1 Rounding float to nearest integer, part 3 →

Rounding float to nearest integer, part 2

Pascal Cuoq - 3rd May 2013

The previous post offered to round a positive float to the nearest integer represented as a float through a conversion back and forth to 32-bit unsigned int. There was also the promise of at least another method. Thanks to reader feedback there will be two. What was intended to be the second post in the series is hereby relegated to third post.

Rounding through bit-twiddling

Several readers seemed disappointed that the implementation proposed in the last post was not accessing the bits of float f directly. This is possible of course:

  assert (sizeof(unsigned int) == sizeof(float));
  unsigned int u;
  memcpy(&u  &f  sizeof(float));

In the previous post I forgot to say that we were assuming 32-bit unsigned ints. From now on we are in addition assuming that floats and unsigned ints have the same endianness so that it is convenient to work on the bit representation of one by using the other.

Let us special-case the inputs that can mapped to zero or one immediately. We are going to need it. We could do the comparisons to 0.5 and 1.5 on u because positive floats increase with their unsigned integer representation but there is no reason to: it is more readable to work on f:

  if (f <= 0.5) return 0.;
  if (f <= 1.5) return 1.;

Now to business. The actual exponent of f is:

  int exp = ((u>>23) & 255) - 127;

The explicit bits of f's significand are u & 0x7fffff but there is not need to take them out: we will manipulate them directly inside u. Actually at one point we will cheat and manipulate a bit of the exponent at the same time but it will all be for the best.

A hypothetical significand for the number 1 aligned with the existing significand for f would be 1U << (23 - exp). But this is hypothetical because 23 - exp can be negative. If this happens it means that f is in a range where all floating-point numbers are integers.

  if (23 - exp < 0) return f;
  unsigned int one = 1U << (23 - exp);

You may have noticed that since we special-cased the inputs below 1.5 variable one may be up to 1 << 23 and almost but not quite align with the explicit bits of f's significand. Let us make a note of this for later. For now we are interested in the bits that represent the fractional part of f and these are always:

  unsigned int mask = one - 1;
  unsigned int frac = u & mask;

If these bits represent less than one half the function must round down. If this is the case we can zero all the bits that represent the fractional part of f to obtain the integer immediately below f.

  if (frac <= one / 2)
  {
    u &= ~mask;
    float r;
    memcpy(&r  &u  sizeof(float));
    return r;
  }

And we are left with the difficult exercise of finding the integer immediately above f. If this computation stays in the same binade this means finding the multiple of one immediately above u.

“binade” is not a word according to my dictionary. It should be one. It designates a range of floating-point numbers such as [0.25 … 0.5) or [0.5 … 1.0). I needed it in the last post but I made do without it. I shouldn't have. Having words to designate things is the most important wossname towards clear thinking.

And if the computation does not stay in the same binade such as 3.75 rounding up to 4.0? Well in this case it seems we again only need to find the multiple of one immediately above u which is in this case the power of two immediately above f and more to the point the number the function must return.

  u = (u + mask) & ~mask;
  float r;
  memcpy(&r  &u  sizeof(float));
  return r;

To summarize a function for rounding a float to the nearest integer by bit-twiddling is as follows. I am not sure what is so interesting about that. I like the function in the previous post or the function in the next post better.

float myround(float f)
{
  assert (sizeof(unsigned int) == sizeof(float));
  unsigned int u;
  memcpy(&u  &f  sizeof(float));
  if (f <= 0.5) return 0.;
  if (f <= 1.5) return 1.;
  int exp = ((u>>23) & 255) - 127;
  if (23 - exp < 0) return f;
  unsigned int one = 1U << (23 - exp);
  unsigned int mask = one - 1;
  unsigned int frac = u & mask;
  if (frac <= one / 2)
    u &= ~mask;
  else
    u = (u + mask) & ~mask;
  float r;
  memcpy(&r  &u  sizeof(float));
  return r;
}

To be continued again

The only salient point in the method in this post is how we pretend not to notice when significand arithmetic overflows over the exponent for inputs between 1.5 and 2.0 3.5 and 4.0 and so on. The method in next post will be so much more fun than this.