Benchmarking static analyzers

Pascal Cuoq - 29th Apr 2012

A meta-article on benchmarking

Some colleagues and I are about to submit this article on the subject of benchmarking static analyzers. Rather than another benchmark it is a list of general recommendations and pitfalls when benchmarking static analyzers. The article is illustrated on C program constructs because that's what we are familiar with but these examples are only intended to show how complicated static analysis can be in general. Had we focused on another language the underlying ideas would have remained the same: different users and toolmakers have different views of what static analysis is about. Even if they agreed on what static analysis is about there would still be no perfect oracle to tell what the right answer to a particular testcase is.

Discussing this article as it was being written several valid points have been raised but unfortunately the limit for the category in which we wish to submit is 4 pages so I have to address these here. Consider this post a sort of addendum to the article.

Why should the C standard be the ultimate authority?

The short facetious answer to this question is that the C standard is the democracy of C semantics: it is the worst form of reference document we have to the exception of all others.

Which version of the C standard?

This is not a big problem in practice. Everyone seems to focus on what compilers use for obvious reasons and what compilers use is C90. In Frama-C we refer to C99 when it is convenient. For instance in C99 division is defined in more detail than in C90. C90 does not specify whether division rounds towards minus infinity (Euclidean division with a positive remainder) or rounds towards zero (all processors implement division like this because it means that the sign of the results can be computed independently from their magnitudes saving transistors). C99 specifies that division rounds towards zero. More information is good so Frama-C follows C99 here. But when C90 and C99 are mutually exclusive and compilers all implement C90 we follow the compilers. In this post about subtle typing issues we are with gcc predicting 2147483648 for the second example not with gcc -std=c99 which compiles the program as displaying -2147483648. (See the following post for the explanations behind these differences).

What if most implementations agree with each other but disagree with the standard?

There are a couple of more interesting examples that we intended for the article to illustrate this very dilemma. One is about pointer comparisons. Technically the comparison &a < &b is undefined behavior. The same reasoning that in the article justifies that analyzers do not have to worry what happens in the program after an illegal pointer access could be applied to this expression.

The point here is not just that most analyzers do not care about this issue and simply treat the comparison as returning 0 or 1. The point is that there are good reasons for an analyzer to let this kind of comparison slip: it happens naturally when a user implementation of memmove() is passed &a as source and &b as destination. An analyzer that lets this construct slip may be an analyzer that received more testing and for which usability studies revealed that users were too confused when &a < &b was treated as if it halted execution. Since there are arguments both ways it is difficult for organizers to arbitrarily decide what an analyzer should do when confronted with this expression. And since it does not seem too fundamental the simple escape that we recommend is to omit the construction from the benchmark. It shouldn't occur in any testcase not as the goal of the testcase (the undefined behavior to be detected) nor as an incidental construct that would bias testcases intended for another criterion.

For the sake of exhaustiveness this can be contrasted to the condition &a == &b + 1. You may expect that this would be undefined too but for some reason it is only unspecified. The implementation may return 0 or 1 and it will not return the same value everytime. Derek Jones would point out here that an implementation could rely on just-in-time compilation and that the value of &a == &b + 1 could as a result change from one evaluation to the other for the same expression referring to the same variables a and b. Anyway it is not undefined so if the standard was the absolute reference static analyzers would not be allowed to consider that execution gets stuck on evaluating &a == &b + 1.

In Frama-C's value analysis we do not see much reason to treat equality and inequality differently but we recognize that both stopping execution and continuing with results {0; 1} can have their uses so we offer an option to choose between these two behaviors. This series of posts already covers this kind of nuance.

Frama-C's value analysis detects warns and considers execution stuck on uninitialized variables. But by design it avoids warning (and considering execution stuck) on the statement y = x; or *dst = *src;. This is a compromise made necessary by the necessity to analyze user implementations of memcpy() which may use exactly that second statement for uninitialized data when copying a struct or union.

One long-standing bug in Frama-C's front-end is that it erases a possible bug during parsing and normalization. According to the letter of the standard addressee().a is dangerous. However this misfeature of the C99 standard has quite a history (in the circle of people who care about this sort of thing) as illustrated by the numerous comments on the linked CERT page [removed dead link]. A static analyzer could omit this bug (and consider execution continues) by a conscious design decision (for instance if it specializes in a compiler that documents that this construct is safe). Another static analyzer could treat it as undefined and consider that execution gets stuck. Again the construct is not very interesting in itself so it shouldn't be restrictive to omit it from the explicit goals of the benchmark and to omit it from the incidental features that might perturb other tests.

Finally there are areas where the standard under-specifies things. One egregious example is floating-point. David Monniaux's report is still the reference I recommend for these floating-point issues.

All but the last of these examples are cases of “Don't do that then”: the issue can simply be avoided by not using any of the ambiguous constructs for which it is unclear whether an analyzer should be allowed to consider them blocking. Also all the issues in these first examples disappear when toolmakers are allowed to participate: when their analyzer scores badly on a test because it is better than others are detecting incidental issues they will be prompt to point out the problem.

For the last example floating-point I do not have a solution to offer. Some real-life software issues come specifically from the liberties that the standard allows and from the fact that compilers take advantage of them so it is a worthy goal for a static analyzer to try to model these liberties and a worthy goal for a benchmark to measure how well the static analyzer is doing. But floating-point in C is a mess and since one of the issues is that compilation is under-specified it does not suffice to use one (or even several) compilations as reference.

What about implementation-defined behavior?

This is a non-issue really. All real C programs invoke implementation-defined behavior. As an example the line int x = 38000; invokes implementation-defined behavior. Therefore all static analyzers that aim to be useful for real programs assume some implementation choices and they all tend to assume the compilation choices of the dominant platform so implementation-defined behavior causes very little trouble in practice.

Acknowledgments

Radu Grigore provided many ideas that did not fit in the article and is directly responsible for the existence of this blog post.

Pascal Cuoq
29th Apr 2012