← Portable modulo signed arithmetic operations Reading indeterminate contents might as well be undefined →

Correct rounding or mathematically-correct rounding?

Pascal Cuoq - 3rd Mar 2013

CRlibm is a high-quality library of floating-point elementary functions. We used it as reference a long time ago in this blog while looking at lesser elementary function implementations and the even lesser properties we could verify about them.

A bold choice

The CRlibm documentation contains this snippet:

[…] it may happen that the requirement of correct rounding conflicts with a basic mathematical property of the function such as its domain and range. A typical example is the arctangent of a very large number which rounded up will be a number larger than π/2 (fortunately ◦(π/2) < π/2). The policy that will be implemented in crlibm will be

• to give priority to the mathematical property in round to nearest mode (so as not to hurt the innocent user who may expect such a property to be respected) and

• to give priority to correct rounding in the directed rounding modes in order to provide trustful bounds to interval arithmetic.

The choice for directed rounding modes is obviously right. I am concerned about the choice made for round-to-nearest. The documentation states the dilemma very well. One can imagine slightly out of range values causing out-of-bound indexes during table look-ups and worse things.

I seldom reason about floating-point programs. I work on static analysis and am only concerned about floating-point inasmuch as it is a requirement for writing a static analyzer correct for programs that include floating-point computations.

However when I do reason about floating-point programs I am more often compounding approximations starting from the base assumption that a correctly rounded function returns a result within 1/2ulp of the mathematical result than I am assuming that atan(x) ≤ π/2. The choice the CRlibm implementors made means that suddenly the reasoning I often make is wrong. The value of atan(x) in the program may not be 1/2ulp from the real arctangent of the same x. It can be more when x is very large and mathematical-correctness overrode correct rounding.

Truck drivers fall asleep at the wheel when they face long dull stretches of straight empty roads. Similarly it is good to have another special case to consider when reasoning about floating-point computations. With only infinites and denormals to worry about it can get you know a bit too easy.

Oh well it's only π/2

In this section I rhetorically assume that it is only π/2 for which there is a problem. The CRlibm documentation reminds us that in the case of double precision we were lucky. Or perhaps it isn't luck and the IEEE 754 committee took the desirableness of the property (double)π/2 < π/2 into account when it chose the number of bits in the significand of the double-precision format.

How lucky (or careful) have we been exactly? Let us test it with the program below — assuming my compilation platform works as intended.

#include <stdio.h> 
#define PI(S) 3.1415926535897932384626433832795028841971693993751##S 
float f = PI(f); 
double d = PI(); 
long double ld = PI(L); 
int main(){ 
  printf("   3.14159265358979323846264338327950288419716939937510"); 
  printf("f  %.50f"  f); 
  printf("d  %.50f"  d); 
  printf("ld %.50Lf" ld); 
}

The result of compiling and executing the program is for me:

   3.14159265358979323846264338327950288419716939937510 
f  3.14159274101257324218750000000000000000000000000000 
d  3.14159265358979311599796346854418516159057617187500 
ld 3.14159265358979323851280895940618620443274267017841

As you can see the nearest single-precision float to π is above π as is the nearest 80-bit long double. The same goes for π/2 because the floating-point representations for π and π/2 only differ in the exponent. Consequently the issue raised by the CRlibm implementors will come up for both functions atanf() and atanl() when it is time to get them done. We were not very lucky after all (or careful when defining the IEEE 754 standard).

A subjective notion

But what exactly is the informal “mathematical correctness” notion that this post is predicated upon? Yes the “innocent user” may expect mathematical properties to be respected as much as possible but there are plenty of mathematical properties! Let us enumerate some more:

If x ≤ 1 in a program then exp(x) should always be lower than the mathematical constant e.

So far so good. The above is a good rule for an exponential implementation to respect. We are making progress.

Here is another property:

If x ≥ 1 in a program then exp(x) should always be greater than the mathematical constant e.

We are decidedly unlucky today because at most one of these is going to be true of any floating-point function exp(). The programmatic value exp(1) must be either above or below the mathematical constant e (it is never equal to it because the mathematical constant e does not have a finite representation in binary).

Why does it matter anyway?

Let us revisit the argument:

to give priority to the mathematical property in round to nearest mode (so as not to hurt the innocent user who may expect such a property to be respected)

I alluded to a possible problem with a programmer computing an array index from atanf(x) under the assumption that it is always lower than π/2. But how exactly would an innocent user even notice that atanf(1e30) is not lower than π/2? The value π/2 cannot exist in eir program any more than e. The user might innocently write an assertion like:

assert(atanf(x)<=(3.1415926535897932f/2.0f));

This assertion will never trigger! The function atanf() will indeed return at most the single-precision float 3.1415926535897932f/2.0f. It does not matter that this number is actually slightly larger than π/2. For all intents and purposes in the twisted world of single-precision floating-point this number is π/2.

Conclusion

There are other scenarios in which the innocent user might genuinely have an unpleasant surprise. The result of a computation may be converted to decimal for humans to read and the user may be surprised to see a value outside the range ey expected. But this user would have the wrong expectations just as if ey expected 10.0 * atan(x) to always be less than 5π. Plenty of these users and developers can be found. But my opinion for what it is worth is that by making special cases you are not helping these users only feeding their delusions.

The correct way to set expectations regarding the results of a floating-point program is numerical analysis. Numerical analysis is hard. Special cases such as the authors of CRlibm threaten to implement only seem to make it harder.