Finding unexpected bugs in the DARPA CGC corpus

André Maroneze - 26th Feb 2019

Recently, we’ve been running Frama-C/Eva on the DARPA challenges from the SATE VI Classic track tool evaluation, organized by NIST. The C programs come from a Linux-compatible port of the DARPA Cyber Grand Challenge. The port was made by Trail of Bits and used as one of the evaluation sets of SATE VI.

These challenges contain each one or more bugs which lead to exploitable vulnerabilities. One of the goals of the Cyber Grand Challenge was to benchmark tools which should be able to automatically identify and patch these vulnerabilities.

However, given the amount of programs (over 240) and the number of lines of code (over 5 million LoC; this includes large tables of initializers and multiple copies of standard library functions, but it is nevertheless an impressive amount of code), the results were not entirely unexpected: Frama-C/Eva managed to find some cases of undefined behavior which were not listed as intentional bugs by the authors. In this post, we report some of these bugs, not all of which are currently detected by sanitizers (such as GCC’s and Clang’s) or Valgrind.

Normalizing the CGC challenges

The challenge set contains code written by several different companies (Kaprica Security, Cromulence, Narf Industries, and others). Each challenge is mostly independent of external libraries: functions such as strlen and malloc are defined in the lib directory of each challenge. One of the consequences of this fact is that the functions have slight variations in their signatures: some follow the C standard, while others replace void* with char*, or omit the const qualifier. Input/output functions, such as receive_* and transmit_*, contain a large amount of variants, which differ in number and order of arguments.

All of these facts make it more difficult to apply a uniform and generic wrapper for all challenges, to automate their testing with Frama-C/Eva. For maximal precision in the analysis, Eva uses its own built-ins for functions in string.h and stdlib.h. However, since each challenge includes its own version of the C standard library, with varying function signatures, an automatic replacement of all these functions with a simple set of stubs is not easy; in some cases, the files will not parse, for instance when a function has the same name as another but differs in the number of arguments.

This normalization phase involved some manual work, by tring to parse each challenge and using generic stubs/wrappers when possible, fixing issues on a case-by-case basis.

Some challenges use non-standard C language extensions, such as Clang’s block extension, structs containing other structs which themselves contain flexible array members, or arithmetic operations with void pointers. In some cases, it is easy to patch them (e.g. by adding non-void* casts), but in others, such as the block extension, the transformation is not so straightforward. A few tests were not analyzed due to these syntactic deviations.

Finding the low-hanging fruit

By using Frama-C’s analysis scripts, creating a generic Makefile, and some stubs for libc functions, we were able to start analyzing the CGC challenges in search of definitive problems (typically, red alarms in the GUI). Some of them are the “typical” causes of undefined behavior that go unnoticed in most cases: logical shifts of negative numbers; and access of uninitialized variables. The former is usually deterministic for a given architecture (although non-portable, and not future-proof: a new version of a compiler may change the behavior), but the latter can lead to nondeterministic problems.

Other cases of (unintended) undefined behavior found in the challenges include off-by-one errors (incorrect loop conditions, or memory allocation for a string which does not include the terminating NUL byte), conditions in the wrong order (e.g. dereferencing an array element before checking if the index is in range) and double free. A summary of findings is available at the Github open source case studies wiki. It will be updated when there are new findings.

Almost all of the cases can be easily fixed once the issue is detected. The Frama-C GUI allows navigating between callers, filtering call stacks, and providing context information to identify the origin of the issue. Once it is fixed, re-running the analysis confirms the issue is resolved, and Eva is able to advance further in the program execution. In one case, fixing an issue revealed a second one later in the analysis.

Sanitizers help, but not always

Sanitizers can be used to help identify undefined behavior, but they are not always sufficient. Even when execution is not dependent on input data (e.g. during initialization), and therefore the undefined behavior is always triggered, several conditions may affect whether sanitizers will detect them.

For instance, in Square Rabbit (a program which simulates a Blackjack-style card game), the user gives inputs such as “H”(it) or “S”(tay), and the amount money to bet, and the program simulates a random deck of cards, a dealer, and tracks the amount of money the player has.

The injected vulnerability in this test concerns a counter related to the number of split hands, which overflows and leads to a crash. However, it turns out that the readline function used in this program has an undocumented issue: whenever the input buffer is full (i.e., the user sent 512 characters or more in a single line), there is an access to the byte just after the last one in the array:

static int readline(int fd, char *buf, size_t size)
  size_t i;
  size_t rx;

  for (i = 0; i < size; i++) {
    if (receive(fd, buf, 1, &rx) != 0 || rx == 0)
    if (*buf == '\n')

  if (i == size && *buf != '\n') // <<< accessing buf[512]
    return -1;
  else if (*buf != '\n')
    return 1;
    *buf = '\0';

  return 0;

When unrolling the loop in Frama-C/Eva (with a loop unroll annotation, for instance), and using option -eva-split-return, to ensure separation of states from the previously called functions, Eva is capable of inferring that, when the input has a line with 512 characters or more, the only way to enter the conditional after i == size is with a pointer to one-past the last element in the array. Thus, reading the byte at this position constitutes an undefined behavior, although unlikely to be exploited. Note that, if that character (which might be the first byte of the next variable declared in the stack, after the array) is a newline, then not only is this one-past element read, but it is also written to, replaced with \0.

As an exercise, we would like to reproduce this bug in an actual execution, using tools such as Valgrind and code sanitizers to instrument the code and obtain an assertion error when the memory is unduly accessed.

Runtime confirmation of undefined behavior using a sanitizer

Let us compile the code using Clang and AddressSanitizer:

clang -fsanitize=address -o square_rabbit src/*.c lib/*. -I. -Ilib

You may have to adapt your command line to ensure all required files are included, with the proper directives.

Then, let us provide some input that triggers the undefined behavior, containing a line larger than 511 characters. The following Python one-liner produces an input that starts a new game and then provides a line with just enough characters to trigger the buffer overflow:

python -c "print('39\n1'); print('A' * 512); print('1000\nS\nQUIT\n2\n')" | \

The above input will:

  • provide a random seed (39);
  • start a new game (option 1);
  • provide some useless input that overflows the buffer (512 A’s);
  • play a game (1000/Stay/QUIT);
  • finally, quit the program (option 2).

If square_rabbit had been compiled without a sanitizer, it would just run normally, and the user would be unaware of the undefined behavior. But Address Sanitizer will detect the off-by-one error and stop the execution:

==20070==ERROR: AddressSanitizer: stack-buffer-overflow on address (...)

Address 0x7ffd436f5ca0 is located in stack of thread T0 at offset 544 in frame
    #0 0x511e7f in play_squarerabbit src/game_sultan.c:260

  This frame has 1 object(s):
    [32, 544) 'input' (line 265) <== Memory access at offset 544 overflows this variable

Valgrind can also detect some kinds of undefined behavior

Another tool that can help detect such issues is Valgrind. Running square_rabbit with our input (without instrumenting it with AddressSanitizer) will report the following:

==21267== Conditional jump or move depends on uninitialised value(s)
==21267==    at 0x401BBB: readline (game_sultan.c:73)
==21267==    by 0x401704: play_squarerabbit (game_sultan.c:255)
==21267==    by 0x401EAA: main (game_sultan.c:376)

The message is less precise than the one reported by AddressSanitizer, but it contains the essential information. For such a small piece of code, the overhead imposed by Valgrind is barely noticeable (0.2s instead of 0.05s).

In the heap, detection is not so easy

In both cases, we were somewhat “lucky” that the error was detected by the tools. Some minor changes, however, can make the error go unnoticed. For instance, if we move the declaration of the local buffer variable (char input[INPUT_SIZE]) to a global declaration, moving it from the stack to the heap, both tools fail to detect the off-by-one overflow.

That is, unless we include another modification: if we move input to become a global variable, and we add an initializer to it (even an initializer semantically equivalent to no initializer at all, such as = {0}), then AddressSanitizer is able to detect the overflow – but not Valgrind. Your mileage may vary according to your compiler version, optimization flags, etc.

Overall, relying on sanitizers and/or Valgrind to identify undefined behavior is not the safest bet: first, they can only detect issues in the current execution, which might be input-dependent; second, even when undefined behavior is triggered, the memory layout of impacted variables may prevent these tools from detecting and reporting the issue.


The DARPA challenges provide a very interesting set of non-trivial C programs for code analyzers. Their inclusion in the SATE 6 Classic track is a good choice for evaluating bug finders. As expected for a code base of such size, there are a few cases of unintentional bugs that evaded testing. Using Frama-C/Eva to analyze these programs allowed the identification of some such bugs. We hope our online report of these findings should help the SATE team to address them in future versions of the tool evaluation. We also expect to include some of the challenge programs in our open source case studies suite.

André Maroneze
26th Feb 2019