About the rounding error in these Patriot missiles
Pascal Cuoq - 18th Nov 2012An old rant: misusing Ariane 5 in motivation slides
I was lucky to be an intern and then a PhD student at INRIA, while it was still called “INRIA” (it is now called “Inria”). This was about when researchers at INRIA and elsewhere were taken to task to understand the unfortunate software failure of the Ariane 5 maiden flight. So I heard the story from people I respect and who knew at least a little bit about the subject.
Ever since, it has been irking me when this example is taken for the purpose of motivating some formal technique or other. Unless you attend this sort of CS conference, you might not believe the non-solutions that get motivated as getting us all closer to a world in which Ariane 5 rockets do not explode.
What irks me is this: even if the technique being motivated were comparably expensive to traditional testing, requiring comparable or a little less time for comparable or a little more confidence, that would not guarantee it would have been used to validate the component. The problem with Ariane 5 was not a failure of traditional tests. The problem was that, because of constraints of time and money, traditional tests were not applied to the component that eventually failed.
If your method is not cheaper and faster than not doing the tests, do not imply that it would have saved Ariane 5. It might have been omitted too.
The report is online. Key quote: “no test was performed to verify that the SRI would behave correctly when being subjected to the count-down and flight time sequence and the trajectory of Ariane 5.”
A new rant: the Patriot missile rounding error
If you attend that sort of conference you have also heard about this other spectacular software failure the Patriot missile rounding error. This bug too is used to justify new techniques.
This one did not irk me until today. I did not happen to be in the right place to get the inside scoop by osmosis. Or at the wrong time.
I vaguely knew that it had something to do with a clock inside the Patriot missile measuring tenths of seconds and the constant 0.1 not being representable in binary. When researchers tell the story on a motivation slide it sounds like a stupid mistake.
There is nothing reprehensible in calculating with constants that are not represented finitely in binary. Other computations inside the missile surely involved the number π which is not finitely representable in binary either. The designer of the system simply must understand what ey is doing. It can get a little tricky especially when the software is evolved. Let me tell you about the single-precision
cosf()
function from the Glibc library in another rant.
Similarly with the Ariane 5 case a rather good-looking summary of the issue is available. Assuming that this summary is correct and it certainly looks more plausible than the rash explanations you get at motivation-slide time the next time I hear a researcher use the Patriot missile example to motivate eir talk I will ask the questions that follow.
- When are you adapting your system to ad-hoc 24-bit fixed-point computations (not whether it is theoretically possible but when you are doing it)?
- When are you adapting your system to ad-hoc non-IEEE 754 48-bit floating-point computations?
- Will your system then detect the drift between the same computation having gone one path and the other?
If your system is only practical in an alternate universe in which the Patriot missile software is cleanly implemented with good old IEEE 754 double-precision values: sorry but in that universe the Patriot missile does not exhibit the problem you are claiming to solve.
Thanks to Martin Jambon for proof-reading this post.