Static analysis benchmarks

Pascal Cuoq - 10th Jan 2012


The first benchmark Frama-C's value analysis was ever evaluated on was the Verisec suite, described in "A buffer overflow benchmark for software model checkers". My colleague Géraud Canet was doing the work thence the facetious-colleagues tag of this post. In retrospect Verisec was a solid piece of work. Being naive when it comes to statistics I didn't recognize it at the time but when I later examined other benchmarks I developed an appreciation for this one. It does a lot of things right.

  • It contains good examples as well as bad. A static analyzer is for separating good programs from bad ones. It should be obvious that you cannot evaluate how well a static analyzer does using only bad programs.
  • The article's title says it's for "Software Model Checkers" but it does a good job of being accessible to a wide range of tools with varying abilities. Some good/bad program pairs are available in "no pointer only arrays" variants and in "small arrays only" variants providing more comparison points between different techniques. This may help discover that some technique detects a wrong behavior with small arrays but does not scale to larger ones whereas another one does not detect them at all.
  • The entries in the benchmark are moderate-sized snippets of code with unknown inputs. If by contrast in most or all of the examples the inputs were fixed and assuming that these examples did not escape in an infinite loop then a detection strategy based on the concrete execution of the program step by step and detection of wrong behaviors on the fly would sail through the benchmark. The benchmark would accurately capture the fact that for such cases an analyzer can easily be made correct and complete. What would be missing would be that users of static analysis tools turn to them to detect wrong behaviors that can occur with some inputs among many and not just to detect that there are some wrong behaviors when known (dangerous) inputs are used.

Géraud was using the Verisec benchmark right too. He spent enough time on each case to uncover some interesting facts. One uncovered interesting fact was a definite bug in one of Verisec's "good" cases where the targeted bug was supposed to be fixed. The explanation was this: there were two bugs in the real-life snippet that had been used as "bad" case and only one had hitherto been noticed. The known bug had been fixed in the corresponding "good" case in the benchmark and in real life. The unknown bug had not as the song of La Palisse might have said.

When a static analysis benchmark is done right as the Verisec benchmark is it tries to measure the ability of tools to solve a hard problem: predicting whether something happens at run-time in a non-trivial program with unknown inputs. Furthermore the reason there is a need for static analyzers and for a benchmark is that humans are not good either at recognizing issues in code. So if the benchmark is representative of the problems a static analyzer should solve there exists no perfect oracle that can help decide what the correct answers are on each snippet and it can be expected to contain a few bugs classified as "good" code.

Is safety static analysis incredibly immature?

In a recent post I compared Frama-C's value analysis to PolySpace and Astrée. It was a single-point comparison in each case each time using an example chosen by the people who sell the analyzer an example used on the analyzer's website to demonstrate the analyzer's precision. In that post's comments Sylvain regretted "that the field of static error-detection for embedded C (as opposed to security-oriented static analysis Cf. the [NIST SAMATE] test suite) is incredibly immature from a customer point of view".

His conclusion is at the opposite of the idea I was trying to express there. I was trying to say that embedded safety static analysis was mature and that all that remained to do to convince was to stop making unrealistic claims and derisive comparisons. Talk about scoring own goals. This blog post is my penitence for not having been clear the first time.

The field of software security is orders of magnitude more complex and disorganized than that of embedded safety. It deals with programs that are larger and use more difficult constructs (e.g. dynamic allocation). The properties that need to be verified are numerous and widely dispersed:

  • There are the undefined behaviors that cause something to happen that was not intended by the programmer. These can breach any of a system's confidentiality integrity or availability. They are the same undefined behaviors that Polyspace Astrée or Frama-C's value analysis detect give or take that a previous version of Astrée did not detect uninitialized variables or that the Nitrogen version of Frama-C's value analysis (initially intended for embedded code) does not include the modelization of free() I have been working on recently — and therefore does not detect undefined behaviors related to free().
  • There are security issues in code without undefined behaviors such as the hashing denial of service attack everyone is talking about these days.
  • And as a last illustration of the variety of security properties timing information leaks the detection of which I recently discussed belong to security considerations.

As a digression we are not worried about the hashing denial of service attack in Frama-C but the counter-measures in OCaml 3.13 (that allows to seed hash functions) are going to be a bit of a bother for Frama-C developers. He that among you never iterated on a hash-table when displaying the results of a regression test let him first cast a stone.

The point is that software security people need a suite like NIST's to describe by example what their field is about (more than 45 000 C/C++ cases to date). The field of embedded safety does not have a similar suite because it does not need one. AbsInt the company that sells Astrée did not wait for me to (gently) make fun of their website before they implemented uninitialized variable detection: they did it of their own volition because they knew Astrée should do it and they had already released a new version with that feature at the time my colleague Florent Kirchner tried their example and noticed the uninitialized variable.

A non-benchmark

Also I think it's a slippery slope to call what NIST has a "test suite" because before long someone will call it a "benchmark" and NIST's suite is no benchmark. Many of the examples have definite inputs making them unrepresentative — specifically making them too easy to handle with techniques that do not scale to the billions of possible inputs that must be handled routinely in real life. There does not seem to be nearly enough "good" examples in the suite: the few I looked at were all "bad" examples and based on this sample an analyzer that always answered "there is an issue in the target code" would get a very good mark indeed if this was a benchmark.

Worse in some cases the "bad" line to find is extremely benign and there are more serious issues in the case. Consider entry 1737 one of the few I looked at:

void test( char * str ) 
  char * buf; 
  char * newbuf; 
  char * oldbufaddress; 
  buf = malloc( MAXSIZE ); 
  if ( !buf ) 
  strncpy( buf  str  MAXSIZE ); 
  buf[MAXSIZE - 1] = '\0'; 
  printf( "original buffer content is : %s"  buf ); 
  oldbufaddress = buf; 
  buf = realloc( buf  1024 ); /*BAD */ 
  printf( "realloced buffer content is : %s"  buf ); 
  printf( "original buffer address content is : %s"  oldbufaddress ); 
  free( buf ); 
int main( int argc  char * * argv ) 
  char * userstr; 
  if ( argc > 1 ) 
    userstr = argv[1]; 
    test( userstr ); 

With the proper modelization for printf() free() and realloc() such as I am currently working on Frama-C's value analysis would warn about both printf() calls. The warning for the first call is because realloc() can fail and return NULL (passing NULL to printf() invokes undefined behavior). In this case there famously also exists a memory leak problem as the initial block is not freed but a memory leak is not an undefined behavior. The warning for the second call is because oldbufaddress becomes indeterminate if realloc() moves the block.

The above is not what the author of the test case wants to hear. What the author of the test case wants to hear clarified by the CWE-244 reference is that the realloc() call is dangerous because if realloc() moves the block it may leave the confidential information that was contained in the block lying on the heap. This is fine if the case is intended as a starting point for discussion (although it would be better without an undefined behavior unrelated to the specific issue being tested).

One problem is that there is no indication in the program that the block contains anything secret. A static analyzer that warns for this program should warn for any use of realloc() since this is how realloc() is intended to be used (it is intended to preserve the contents of the block being resized). An optimal analyzer for this issue is grep realloc *.c.

It does not hurt to zero the secret data before you free() a block and avoid realloc() in a "belt and suspenders" approach to security but if you use Frama-C's value analysis properly to verify that the program does not have undefined behaviors you do not have to do this and the code will still be safe from heap inspection vulnerabilities. Perhaps using the value analysis "properly" is inapplicable in your case; but you should at least be aware of the design choices here. Case 1737 by itself just seems to pick one specific circumvention method for a general issue and then confuses the resolution of the general issue with the application the specific method. The map is not the territory.

Were this case used as a benchmark it would very much remind me of primary school tests written by teachers without any deep understanding of the matter they were teaching and where you had to guess what was in these teachers' minds at each of their unclear questions. Here for instance the value analysis warns that the program may use a dangling pointer in the second printf() — intended to symbolize the information leak. If the program instead called malloc() and tried to use the contents of the new block without having initialized it then the value analysis would warn about that. Generally speaking any way that I know of the contents of the initial block can be read from the heap is flagged at some point as an undefined behavior but this is not what case 1737 tests because case 1737 tests "analyzer warns about realloc()" and the correct answer is to warn about realloc().


Coming back to the question of what field is immature from a customer point of view if your interest is in software security NIST's suite certainly can be used as a basis to understand what kind of static analyzer you may be after. You can try candidate static analyzers on cases that you have selected as most representative of your own problems. But it won't save you from reading each candidate's documentation understanding how each of them work and carefully examining their results. Any mechanical interpretation of the results will lead you to pick grep realloc without even understanding case 1737 and its limitations.

The issue for comparing embedded safety static analyzers is that the design space is too large and the design space for security static analysis certainly isn't any smaller. And if someone ever boasts that their security-oriented software analyzer covers 95% of the NIST SAMATE "benchmark" they are probably selling snake oil. Just saying.

Pascal Cuoq
10th Jan 2012