Csmith testing

Pascal Cuoq - 16th Jan 2012

This page contains various remarks that did not make it into our article about Csmith-testing Frama-C. It is an annex of sorts except that it is written in a less formal style that will be familiar to readers of the Frama-C blog. It is not intended to make sense by itself: if you have not read the article start there and then come back here. Sorry for the disorganized presentation.

Why standard differential testing does not look directly applicable

Here is standard differential testing as described by McKeeman in 1998. It works well when testing compilers and when you can generate defined programs which two different compilers are supposed to transform into binaries that compute the same results.

 
                  Generator 
                      ↓ 
                  C program 
                    ↙  ↘ 
           Compiler1     Compiler2 
               ↓             ↓ 
            Binary1       Binary2 
               ↓             ↓ 
            Result1       Result2

Result1 and Result2 are compared and if they are different then at least one of Compiler1 or Compiler2 must have a bug. This is how Csmith was originally intended to be used. The problem we faced was to take advantage of Csmith to test a static analysis framework (instead of a compiler).

It does not look like it would be fruitful just to replace Compiler1 and Compiler2 in this scheme with Analyzer1 and Analyzer2 because two different analyzers will make different approximations as part of their nominal behavior. Then they would both need to project their results (expressed in different abstract domains and I am assuming both analyzers rely on abstract interpretation here) onto a common abstract domain for comparison. This would cause more approximations. All in all you could hope to detect that something is wrong if the approximated projected values predicted by Analyzer1 had an empty intersection with the approximated projected values predicted by Analyzer2 on the same defined program. However even if a bug manifests itself in one of the analyzer it risks being hidden by all these approximations and remaining unnoticed.

The “value analysis as C interpreter” oracle

So a second static analyzer is not the best oracle to test a static analyzer. Instead we used oracles based on a reference compiler even if in two instances this meant modifying the value analysis solely for the purpose of making it testable. One oracle we used (paragraph Value analysis as C interpreter in the article) was to modify Frama-C's value analysis so that it would behave as a plain interpreter analyzing the Csmith-generated program with maximum precision to the point of predicting the actual value of the checksum computed by the program.

 
                  Generator 
                      ↓ 
                  C program 
                    ↙  ↘ 
           Compiler1      
               ↓         Value analysis 
            Binary1          ↓ 
               ↓           
            Result1       Result2

Here is one representative statement from a Csmith-generated program:

 
    (*l_1209) ^= func_2(func_7(g_13  g_13.f0  g_13.f0   
   func_14(func_20(l_25  ((*l_29)--)  func_32(func_34((((((*l_963) = 
   func_37((safe_unary_minus_func_int32_t_s(( 
   safe_rshift_func_uint16_t_u_s((safe_add_func_uint8_t_u_u(func_47( 
   l_25)  (safe_mul_func_uint16_t_u_u(((safe_add_func_uint64_t_u_u((( 
   *l_102) = 0xC3DD168C87B4EC18LL)  (safe_mul_func_int16_t_s_s(((( 
   g_117 = (safe_sub_func_int16_t_s_s(((safe_lshift_func_int16_t_s_u(( 
   g_116 = (g_115 = ((*l_112) ^= (((g_76.f1.f1   g_110[2]) != &g_111)   
   0xF5F5L))))  l_25)) || g_76.f1.f1)  g_76.f1.f2))) < l_118) >= (-8L))  
   g_65)))) < l_25)  g_13.f0))))  g_76.f1.f3))))  g_119  l_26[1]  &g_65)) 
   == g_964) == 0x24AE7014CC3713C7LL) | l_25)  g_966))  g_415)  
   l_988  g_967  l_118  l_25)  l_25)  l_118  (*g_676)  l_988);

This statement comes from this example Csmith-generated program. Here are the results of the commands that could be executed when following this approach:

 
$ ~/ppc/bin/toplevel.opt  -cpp-command "gcc -C -E -Iruntime -Dprintf=Frama_C_show_each" \ 
 -val-signed-overflow-alarms -stop-at-first-alarm -no-val-show-progress -precise-unions \ 
 -machdep x86_64 -obviously-terminates -big-ints-hex 0 -val t.c 
... 
[value] Called Frama_C_show_each({{ &"checksum = %X\ 
" }} {0x2E412D3}) 
... 
$ gcc -Iruntime t.c && ./a.out 
... 
checksum = 2E412D3

The predicted values are the same meaning that there is no bug to observe here. One kind of “interesting” result in the Chinese curse sense of the word would be if the value analysis predicted a value different from execution say 0x3F52E1A9. This could indicate a soundness bug.

Since we have decided for the purpose of testing that every construct that may appear in Csmith programs should be handled precisely (in a “singleton in singleton out” fashion) another interesting result that would require further investigation would be an imprecise result such as [0 .. 0xFFFFFFFF] for the computed checksum. Although there is nothing incorrect in this result we would be annoyed that it might hide an incorrect operation by absorbing it. One real font-end bug was found while investigating such a result. This bug only caused unjustified and disproportionate loss of precision in the value analysis (not a soundness issue) but it was a typing bug that could have had more severe consequences in another plug-in if it had continued unnoticed.

In fact the Value analysis as C interpreter method found quantities of bugs because it was first to be applied to test Frama-C. Being used first it found most of Frama-C's front-end bugs. Front-end bugs probably could have been noticed using any other oracle. Still it was good to find them using this method that makes them easiest to understand. This is an example of the general idea that section Bug triage in the article tries to convey regarding the bottom-up testing of functionalities.

Fixing front-end bugs was satisfying because they potentially affect all plug-ins. Fixing them makes Frama-C a better framework for any use including for writing new more advanced plug-ins.

This method's main issue is that it uses the value analysis differently from the way it is commonly used. A bug that only occurred when the value analysis is making approximations would not be noticed with the above technique because the value analysis misused as a plain interpreter does not make approximations at all while tested with this oracle.

The more oracles the merrier

This sentence was part of one reviewer's summary:

The idea is to define complementary oracles that cover most of the functionalities of the tool.

This is exactly it. This sentence or one like it should have been part of the submitted article. As one slight nuance I would only say that we tested as much as possible of the functionality of the tool — as much as we knew how to test given the random program generator we had in fact. As an example of untested functionality Csmith does not generate ACSL annotations as part of its programs so Frama-C's handling of ACSL annotations has received very little testing during this campaign.

Another value analysis-based oracle

Early in the lifecycle of the value analysis a bug was privately reported that was characterized by unsound results when +0.0 was the upper bound of the floating-point values a variable could have. This bug was fixed eons ago but had it still been there it could not have been found using the value analysis as an interpreter because in this mode the value analysis never manipulates value sets that are not singletons. In order to get a shot at bugs that occur only when the value analysis manipulates approximate values we used a different oracle described in the article under Printing internal invariants in executable C form.

When used as an analyzer that is with the default options on the same example as previously the value analysis produces results such as:

 
$ ~/ppc/bin/toplevel.opt  -cpp-command "gcc -C -E -Iruntime -Dprintf=Frama_C_show_each" \ 
 -val-signed-overflow-alarms -no-val-show-progress -machdep x86_64 -val t.c 
... 
[value] Called Frama_C_show_each({{ &"checksum = %X\ 
" }} [0..4294967295]) 
... 
[value] Values at end of function main: 
... 
          g_125 ∈ {61; 72; 73} 
          g_228.f0 ∈ {-1; 7} 
...

The analyzer does not have any chance to predict the value of the checksum itself computed by the Csmith-generated program from the contents all global variables. On the other hand the values of some global variables are predicted more or less accurately depending on what the program does with them. Strategically inserting a call to the primitive Frama_C_dump_assert_each() we can get the analyzer to print this information as a snippet of C code:

 
--- t.c	2012-01-10 13:29:26.000000000 +0100 
+++ u.c	2012-01-10 13:54:31.000000000 +0100 
@@ -1600 6 +1600 7 @@ 
         } 
     } 
     platform_main_end(crc32_context ^ 0xFFFFFFFFUL  print_hash_value); 
+    Frama_C_dump_assert_each(); 
     return 0; 
 }

Analyzing the Csmith-generated program thus modified prints:

 
u.c:1603:[value] Frama_C_dump_assert_each called: 
        (*(unsigned int*)&i == 1 
... 
         && 61 <= *(unsigned char*)&g_125 && *(unsigned char*)&g_125 <= 73 
         && -1 <= *(char*)&g_228 && *(char*)&g_228 <= 7 
...

The entire memory state as predicted by the value analysis is printed as a C expression. Being pretty-printed directly from the untyped memory representation it is not very readable but it is intended for the compiler and the compiler does not mind. If the analyzer is right the expression evaluated at the end of the program should be true. A sed script inserts the expression into the program for instance:

 
    //    Frama_C_dump_assert_each(); 
    int seems_ok = <generated expression> ; 
    printf("seems ok:%d\ 
"  seems_ok);

The instrumented program is then compiled and run:

 
$ gcc -Iruntime v.c && ./a.out  
... 
checksum = 2E412D3 
seems ok:1

This time we are only concerned by the run-time value of variable seems_ok. If this variable is not printed as 1 for a defined input program it indicates a bug in the analyzer or in the compiler. This technique can be summarized with the diagram below.

 
                  Generator 
                      ↓ 
                  C program 
                    ↙  ↘ 
                        Value analysis 
                   ↓         ↓ 
                        Executable invariant 
                    ↘  ↙   
                  Instrumented program 
                      ↓ 
                  Compiler 
                      ↓ 
                   Binary 
                      ↓ 
                  Diagnostic

The primitive Frama_C_dump_assert_each() too was implemented specifically for the purpose of Csmith testing. A refined version may find uses in the qualification of the value analysis when used for certification credit.

Earlier I was saying that the ACSL-manipulating component of Frama-C received little testing in this campaign. Still a couple of bugs found were related to the ACSL annotations. These were for alarms emitted by the value analysis as ACSL annotations in the mode described here (working as a static analyzer) and subsequently re-interpreted by the value analysis. All Frama-C developers do not agree that the value analysis re-interpreting alarms that come from itself is a good idea (it is not supposed to learn anything new from them! In theory anyway). Nonetheless this helped identify some bugs so the idea has its merits if only as a kind of self-check.

Testing the constant propagation plug-in

Frama-C's constant propagation plug-in transforms programs. It replaces any expression that can only take one value during execution (as predicted by the value analysis) by this value. The result of this plug-in is supposed to be a program that produces the same results as the original. This can be tested following the diagram below:

 
                  Generator 
                      ↓ 
                  C program 
                    ↙  ↘ 
           Compiler      Constant propagation plug-in 
                             ↓ 
               ↓          Transformed program 
                             ↓ 
            Binary1       Compiler 
                             ↓ 
               ↓          Binary2 
                             ↓ 
            Result1       Result2

Testing the slicing plug-in

Frama-C's slicing plug-in provides another program transformation. Frama-C's slicer is semantic. When used with the slicing criterion “preserve the call to printf() and the checksum it receives as argument” it is supposed to produce a program that prints the same checksum as the original while removing any useless statement it finds. There are many of these in a randomly generated program despite the checksum being computed over all global variables at the end of the program.

 
                  Generator 
                      ↓ 
                  C program 
                    ↙  ↘ 
           Compiler      Slicing plug-in 
                             ↓ 
               ↓          Transformed program 
                             ↓ 
            Binary1       Compiler 
                             ↓ 
               ↓          Binary2 
                             ↓ 
            Result1       Result2

The slicer may remove from the Csmith-generated program assignment to local variables that do not subsequently impact the final values of global variables like this:

 
void func_123(void) 
{ 
/* int l_437; 
   l_437 = g_232 + g_12; */ 
   g_15 = 3; 
}

The slicer may also remove assignments to global variables that are subsequently over-written without the intermediate value being used in the final checksum like this:

 
void func_1(int p_1) 
{ 
   int *l_12 = &g_15; 
/* g_15 = x; */ 
   *l_12 = 3; 
}

It goes without saying that a sound slicer given the criterion “slice for the value of the checksum” should preserve all computations that influence the value of the checksum and consequently produce a program that computes the same checksum. That is the definition of “slicer” (or the definition of “sound” depending how you want to look at it).

In our running example the randomly generated func_2() looks like this:

 
/*  
 * reads : 
 * writes: 
 */ 
static int32_t  func_2(struct S3  p_3  int32_t  p_4  int64_t  p_5  int16_t  p_6) 
{ /* block id: 665 */ 
    int16_t l_1206 = 0x5E39L; 
    int32_t *l_1207 = (void*)0; 
    int32_t *l_1208[7] = {(void*)0  (void*)0  &g_116  (void*)0  (void*)0  &g_116  (void*)0}; 
    int i; 
    p_3.f2 &= l_1206; 
    return p_3.f3; 
}

The version sliced for the global criterion “compute the same checksum” sticks to the computation of the function's result and omits all the declarations and initializations of local variables. The parameters p_4 p_5 p_6 were not used for anything so these have been removed from the prototype of the function:

 
static int32_t func_2_slice_1(struct S3 p_3) 
{ 
  int32_t __retres; 
  __retres = (int)p_3.f3; 
  return (__retres); 
}

Perhaps the func_2() example seems too simple. One could argue it was obvious that most of func_2() was useless. Here is a more complicated example function func_1() from the same Csmith-generated program. The original is like this:

 
/*  
 * reads : g_13 g_28 g_56 g_60 g_64 g_75 g_76.f1.f4 g_76.f3 g_96 g_76.f2 g_76.f1.f1 g_110 g_113 g_76.f1.f2 g_65 g_76.f1.f3 g_119 g_125 g_116 g_191 g_261 g_117 g_288 g_115 g_76.f1 g_228.f0 g_103 g_287 g_76 g_414 g_415.f4 g_228 g_482 g_192 g_58 g_111 g_415.f0 g_596 g_415 g_642 g_597 g_656 g_675 g_680 g_676 g_575 g_631 g_776 g_805 g_964 g_966 g_777 g_967 g_1075 g_965 g_1095 g_1188 g_1189 g_1210 
 * writes: g_28 g_56 g_58 g_60 g_64 g_75 g_96 g_76.f2 g_103 g_113 g_115 g_116 g_117 g_125 g_191 g_228 g_119 g_111 g_261 g_65 g_287.f0 g_287 g_414 g_76.f0 g_76.f3 g_642 g_656 g_675 g_680 g_482 g_931 g_631 g_1068 g_13 g_1210 
 */ 
static struct S0  func_1(void) 
{ /* block id: 0 */ 
    uint32_t l_25 = 4294967290UL; 
    uint32_t *l_26[6]; 
    uint32_t *l_27[3][3] = {{&g_28  &g_28  &g_28}  {&g_28  &g_28  &g_28}  {&g_28  &g_28  &g_28}}; 
    uint32_t *l_29 = &g_28; 
    uint64_t *l_102 = &g_103; 
    int16_t *l_112 = &g_113[2][0]; 
    int16_t *l_114[5]; 
    uint8_t l_118 = 0x86L; 
    uint32_t **l_963 = &l_27[2][0]; 
    int32_t l_988 = 0xAEDD27D2L; 
    int32_t *l_1209 = &g_1210[3]; 
    struct S0 l_1213[7] = {{9L}  {-1L}  {9L}  {-1L}  {9L}  {-1L}  {9L}}; 
    const struct S0 *l_1215 = (void*)0; 
    const struct S0 **l_1214 = &l_1215; 
    int32_t *l_1216[2][9] = {{&g_116  (void*)0  &g_116  (void*)0  &g_116  (void*)0  &g_116  (void*)0  &g_116}  {&g_116  (void*)0  &g_116  (void*)0  &g_116  (void*)0  &g_116  (void*)0  &g_116}}; 
    int32_t *l_1218 = (void*)0; 
    int i  j; 
    for (i = 0; i < 6; i++) 
        l_26[i] = (void*)0; 
    for (i = 0; i < 5; i++) 
        l_114[i] = &g_115; 
    (*l_1209) ^= func_2(func_7(g_13  g_13.f0  g_13.f0  func_14(func_20(l_25  ((*l_29)--)  func_32(func_34((((((*l_963) = func_37((safe_unary_minus_func_int32_t_s((safe_rshift_func_uint16_t_u_s((safe_add_func_uint8_t_u_u(func_47(l_25)  (safe_mul_func_uint16_t_u_u(((safe_add_func_uint64_t_u_u(((*l_102) = 0xC3DD168C87B4EC18LL)  (safe_mul_func_int16_t_s_s((((g_117 = (safe_sub_func_int16_t_s_s(((safe_lshift_func_int16_t_s_u((g_116 = (g_115 = ((*l_112) ^= (((g_76.f1.f1   g_110[2]) != &g_111)   0xF5F5L))))  l_25)) || g_76.f1.f1)  g_76.f1.f2))) < l_118) >= (-8L))  g_65)))) < l_25)  g_13.f0))))  g_76.f1.f3))))  g_119  l_26[1]  &g_65)) == g_964) == 0x24AE7014CC3713C7LL) | l_25)  g_966))  g_415)  l_988  g_967  l_118  l_25)  l_25)  l_118  (*g_676)  l_988); 
    l_1218 = l_1209; 
    return (**g_776); 
}

A large part of function func_1() is the representative line quoted earlier. The version of func_1() in the sliced program looks a little bit different because a normalization phase occurred before the slicer even started working. This normalization is usually not an issue for human-written programs:

 
static void func_1_slice_1(void) 
{ 
  uint32_t l_25; 
  uint32_t *l_29; 
  uint64_t *l_102; 
  int16_t *l_112; 
  uint8_t l_118; 
  int32_t l_988; 
  int32_t *l_1209; 
  int32_t tmp_17; 
  struct S3 tmp_16; 
  int32_t tmp_10; 
  uint16_t tmp_9; 
  uint8_t tmp_8; 
  uint16_t tmp_6; 
  uint64_t tmp_5; 
  int16_t tmp_3; 
  int16_t tmp_2; 
  int tmp_1; 
  uint64_t tmp_4; 
  uint8_t tmp_7; 
  uint32_t *tmp; 
  l_25 = (unsigned int)4294967290UL; 
  l_29 = & g_28; 
  l_102 = & g_103; 
  l_112 = & g_113[2][0]; 
  l_118 = (unsigned char)0x86L; 
  l_988 = (int)0xAEDD27D2L; 
  l_1209 = & g_1210[3]; 
  { /*undefined sequence*/  
    { /*undefined sequence*/  
      *l_112 = (short)((long)*l_112 ^ 0xF5F5L); 
      g_115 = *l_112; 
      g_116 = (int)g_115; 
    } 
    tmp_1 = 1; 
    tmp_2 = safe_sub_func_int16_t_s_s_slice_1((short)tmp_1 (short)g_76.f1.f2); 
    g_117 = (unsigned int)tmp_2; 
    tmp_3 = safe_mul_func_int16_t_s_s_slice_1((short)((long)(g_117 < (uint32_t)l_118) >= -8L)  
                                              (short)g_65); 
    tmp_4 = (unsigned long)0xC3DD168C87B4EC18LL; 
    *l_102 = tmp_4; 
    tmp_5 = safe_add_func_uint64_t_u_u_slice_1(tmp_4 (unsigned long)tmp_3); 
    tmp_6 = safe_mul_func_uint16_t_u_u_slice_1((unsigned short)(tmp_5 < (uint64_t)l_25)  
                                               (unsigned short)g_13.f0); 
    tmp_7 = func_47_slice_1((int)l_25); 
    tmp_8 = safe_add_func_uint8_t_u_u_slice_1(tmp_7 (unsigned char)tmp_6); 
    tmp_9 = safe_rshift_func_uint16_t_u_s_slice_1((unsigned short)tmp_8  
                                                  (int)g_76.f1.f3); 
    tmp_10 = safe_unary_minus_func_int32_t_s_slice_1((int)tmp_9); 
    tmp = func_37_slice_1((unsigned int)tmp_10 g_119 & g_65); 
    func_34_slice_1((unsigned short)((unsigned int)((long long)(tmp == g_964) == 0x24AE7014CC3713C7LL) | l_25)); 
    func_32_slice_1(); 
    *l_29 -= (uint32_t)1; 
    func_14_slice_1(l_988); 
    tmp_16 = func_7_slice_1(); 
    tmp_17 = func_2_slice_1(tmp_16); 
    *l_1209 ^= tmp_17; 
  } 
  return; 
}

The struct S0 returned by func_1() in the original program was not used for anything. Consequently in the sliced program func_1() has been changed in a function returning void. Some of the computations taking place in local variables are kept because in the complete program they end up influencing the final value of a global variable and therefore the computed checksum. I will let the reader judge for emself how obvious the frontier between useful and removable computations was.

The command to reproduce the slicing computation above should be something like this:

 
frama-c -no-collapse-call-cast -cpp-command "gcc -C -E -D__FRAMAC -Iruntime" -machdep x86_64 -val-signed-overflow-alarms -no-val-show-progress -slice-calls Frama_C_show_each  -slevel 50 -slevel-function crc32_gentab:0 t.c -then-on 'Slicing export' -print -ocode sliced.c

Does random testing waste time?

A valid question with random testing is whether any of the bugs found would ever have occurred in a human-written program. Indeed some of the bugs found seem pretty unlikely. There are several answers to this question.

One answer is that the craftsman takes satisfaction in fixing even unnoticeable defects. Frama-C is still small enough to aim to behave as specified. One may argue in some cases that the specification is the wrong one and some bugs have been postponed for a long time but the developers still aim at eventually fixing all bugs.
The confidentiality culture in embedded systems and the open-source nature of Frama-C often mean that we receive bug reports that aren't all that they should be. We would rather handle perfect automatically generated bug reports from a script than some of the tripe that gets submitted on Frama-C's bug tracking system even if an automatically generated bug report has only one chance in three to have an incidence on human-written programs.
In parallel to the Csmith experiment we were using Frama-C for case studies discovering and fixing bugs the traditional way. As a last argument in this discussion some of the bugs we found using the traditional techniques had a distinct “this could have been found with Csmith” feeling. Specifically bugs found the traditional way in Frama-C's typing of the C switch construct would have been found in a jiffy by Csmith if only Csmith generated switch constructs.

But wait! There's more

The complete set of scripts and patch used is also provided (if you can make sense of the stuff you are a better programmer than I am). If you want to report bugs found with these scripts or with a variation of them please do reduce them yourself before submitting.

The authors thank Eddie Kohler and the anonymous reviewers of the article for their remarks on intermediate versions of the article.