Parsing realistic code bases with Frama-C

André Maroneze - 6th Jul 2018

A recurring subject when using Frama-C, especially intriguing for newcomers, is how to reach the stage where parsing of a realistic code base (as in, one that uses external libraries and compiles to different architectures) succeeds. This post will show a bit of the depth involved in the subject, presenting frequently encountered issues, and discussing some possible solutions, with varying trade-offs between short-term and long-term viability.

Frama-C parsing and external libraries

Most C software is non-portable, at least in the POSIX sense: usage of Linux-, BSD- or glibc-specific libraries is present in the majority of C code bases. This is not a huge issue for compilers, since they only need to know which symbols exist. The library code itself is responsible for its behavior.

Frama-C, however, in order to perform analyses without having access to the source code of external libraries, requires extra information, typically ACSL function contracts. Extremely simple contracts can be generated automatically (as in, “returns any possible value”), but as soon as there is mutable global state, or pointers being passed to/returned from a function, automatic generation tends to fail.

The usual process of writing a specification consists in:

  1. Reading the English documentation about the function (e.g. its man page, preferably from POSIX, if availale);
  2. Writing requires/assigns/ensures clauses;
  3. Testing and refining the specification to improve how plug-ins handle it.

Even when a function is not in POSIX, there is usually a man page (e.g. from the Linux Programmer’s Manual) with an English specification of the function, which may also serve as reference for the ACSL contract.

Due to the sheer amount of existing C libraries, having specifications for all of them is currently infeasible. In order of priority, Frama-C’s libc includes specifications from (1) functions in the C standard; (2) POSIX functions; (3) non-POSIX functions (GNU, BSD, etc).

After parsing each source file separately, Frama-C performs a linking step in which function declarations and type definitions are merged if they appear in several files. For instance, if two files both include stdlib.h, there will be two prototypes for malloc(), both referring to the same function. However, two files defining a static function with the same name will lead to two different functions: one of them will be renamed, e.g. with a trailing _1.

During the linking step, if there are multiple ACSL specifications for a given function, they are merged; in particular, if one version of the function has an ACSL specification but not the other, the merged version will have it.

If users want to benefit from Frama-C’s libc specifications, they must ensure that (1) FRAMAC_SHARE/libc is included in the preprocessor’s include path (this is the default), and (2) an explicit #include directive is present in the code. For instance, a .i source which has been preprocessed while including only standard system headers but not Frama-C’s will lack ACSL specifications.

When possible, prefer passing non-preprocessed sources to Frama-C: it is easier to navigate the code, and changes do not require redoing the preprocessing step.

Mixing Frama-C and system headers

A common mistake when using Frama-C consists in mixing headers from Frama-C’s libc with those coming from the system, e.g. in /usr/include.

Mixing Frama-C and non-Frama-C headers is error-prone: as a rule of thumb, adding -I/usr/include to -cpp-extra-args only makes sense when using -no-frama-c-stdlib to exclude Frama-C’s libc. If unsure, avoid including anything in /usr and try to fix parsing issues as indicated below.

The main reason why mixing headers is discouraged is due to the presence of architecture-specific files in many system headers, namely those in the bits directory. Including a system header may entail recursive inclusion of others, leading to types being redefined in a way that may be incompatible with Frama-C’s libc.

Using configure to help parsing

configure scripts can be used to help parsing complex code bases. As an example, let us consider nginx. After downloading its sources and running ./configure, trying to parse most of its files will result in errors such as:

In file included from src/core/ngx_config.h:26:0,
                 from src/core/nginx.c:8:
src/os/unix/ngx_linux_config.h:33:10: fatal error: sys/vfs.h: No such file or directory
 #include <sys/vfs.h>            /* statfs() */
          ^`````````~

This is because, by default, Frama-C includes its own libc but not the system’s. sys/vfs.h is non-POSIX and does not exist in Frama-C’s libc.

Short-term solution: copy sys/vfs.h from your system’s libc to a directory such as ext_include, then add -I ext_include to -cpp-extra-args. Repeat for each recursively included header, until no more headers remain, or until something breaks. If the latter, remove extraneous definitions in an attempt to remedy the problem.

Long-term solution: see what the real sys/vfs.h file contains, and try to understand if (1) this can be reasonably replicated in Frama-C’s libc, or (2) extract only the necessary parts from it and add them to a stubs header. Try to keep portability in mind, using standard types and avoiding compiler-specific features.

Lazy solution: create an empty sys/vfs.h file in a directory that is in the include search path. For instance, create a ext_include directory, then a sys directory inside it, and put an empty vfs.h inside. Add -I ext_include to -cpp-extra-args and try parsing again. This is lazy in the functional programming sense: why try to parse an entire header without knowing if its definitions will actually be necessary? Wait until parsing complains about a missing type or function, and then add those as needed.

WRONG solution: add -I /usr/include to Frama-C’s -cpp-extra-args. This may work (or appear to work) in a few cases, but it is likely to lead to complex errors later.

In this specific case, sys/vfs.h is simply a synonym for sys/statfs.h, so we can copy its contents and work on the latter. sys/statfs.h, however, includes bits/statfs.h, which is a red herring: files in bits directories are architecture-specific.

Copying include files from a bits directory is rarely a good idea: their definitions are almost always architecture-specific and clash with Frama-C’s portable definitions.

For now, we assume the lazy solution, and try again. This time, we get a missing crypt.h file, which comes from the glibc. Iterating the process, the following files may appear in the list of includes: sys/pctl.h, sys/sendfile.h, sys/epoll.h, linux/capability.h, pcre.h, etc. Most of these are in fact optional and detected when running ./configure. Since ./configure takes into account the underlying system, and not the Frama-C libc, it may detect libraries which we would prefer to have disabled, to avoid having to stub them.

Running ./configure --help, we see several --without-<feature> options, such as --without-pcre, --without-poll_module, etc. Disabling them usually minimizes the amount of external, unspecified code during the analysis.

Most configure scripts allow disabling optional external libraries (e.g. using --without-<feature>); doing so is a good idea to minimize the amount of required stubbing. Alternatively, you can manually edit config.h-style macros such as HAVE_<FEATURE>, setting them to 0.

Importing type definitions

Besides missing includes, the most common case of parsing failure is the absence of type definitions. This is compounded with the fact that error messages about missing types are rarely explicit.

In our nginx example, after setting the missing headers, this is the error message we obtain:

[kernel] src/os/unix/ngx_setaffinity.h:16:
  syntax error:
  Location: line 16, between columns 8 and 18, before or at token: ngx_cpuset_t
  14    #if (NGX_HAVE_SCHED_SETAFFINITY)
  15
  16    typedef cpu_set_t  ngx_cpuset_t;
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  17
  18    #elif (NGX_HAVE_CPUSET_SETAFFINITY)

What the message actually means is that cpu_set_t is unknown. This is because this type is not defined in Frama-C’s sched.h, since it is not in POSIX.

Frama-C’s libc proritizes definitions in the C standard, and then in POSIX. Other definitions are added when considered useful enough, that is, as soon as a few case studies require them. Until then, they can be easily added to a file such as __fc_stubs.h, to be included as follows:

  1. either by explicitly adding it to the headers requiring it: #include "__fc_stubs.h";

  2. or by using GCC’s -include option: -include__fc_stubs.h.

The second option avoids touching the original sources. Also, isolating all such changes in a single file makes it easier to update it when future releases of Frama-C include more definitions, until (hopefully) the file becomes unnecessary.

Finding out which definitions to import

Finding out which definitions to import is not always easy. For instance, in the case of cpu_set_t, first we must find out where it is defined. Using rgrep, ag or similar tools, we just look for the definition in /usr/include. Here, we found it in /usr/include/bits/sched.h:

typedef struct
{
  __cpu_mask __bits[__CPU_SETSIZE / __NCPUBITS];
} cpu_set_t;

Unfortunately, now we have 3 other definitions to look for: __cpu_mask, __CPU_SETSIZE and __NCPUBITS.

Short-term solution: use gcc -E to preprocess the file and get rid of macros. Note that we cannot do it directly with bits/sched.h, but we can with sched.h, which results in:

typedef struct
{
  __cpu_mask __bits[1024 / (8 * sizeof (__cpu_mask))];
} cpu_set_t;

Long-term solution: do not preprocess macros away, since they reveal architectural dependencies and other parameters which may be useful later. In our case, __CPU_SETSIZE is actually used elsewhere in nginx, so it is better to preserve it.

In any case, we still have to find out the type of __cpu_mask.

The layers upon layers of macros and typedefs in the standard library usually require several steps before the actual type definition is found. Using the Frama-C GUI, navigation is often easier.

Short-term solution: use Frama-C and its GUI to quickly get the type: parse the system header file with -frama-c-no-stdlib, then use the Information panel in the GUI to navigate to its definition. Caveat: by default, Frama-C erases unused types, so trying to parse the header itself will result in an empty view in the GUI. Instead, create a very simple test.c file using the target type, as in:

#include <sched.h>

void main() {
    __cpu_mask cm;
}

Now parse it with frama-c-gui -no-frama-c-stdlib test.c, go to the declaration of m, and use the links in the Information panel to retrieve the type information:

Navigating through types

Important: do not forget to use the proper -machdep when doing this! Type information displayed in the GUI is relative to the chosen machdep. This is one of the reasons to prefer the long-term solution.

This procedure also works with structs and unions, where each field can be queried invididually. However, enum type constants are already unfolded in some cases, so it is not always possible to quickly navigate through them using this method.

Long-term solution: recursively search for each definition in the type, until you get to the bottom of it. This allows a better understanding of typing issues, such as architecture dependencies. In the case of __cpu_mask, we have:

  • typedef __CPU_MASK_TYPE __cpu_mask;
  • #define __CPU_MASK_TYPE __SYSCALL_ULONG_TYPE;
  • then either __UQUAD_TYPE (if on a 64-bit, ILP32 machdep) or __ULONGWORD_TYPE;
  • finally, unsigned long int.

Therefore, a reasonable definition for cpu_set_t might be:

typedef unsigned long __cpu_mask;

# define __NCPUBITS (8 * sizeof (__cpu_mask))
# define CPU_SETSIZE 1024

typedef struct {
  __cpu_mask __bits[CPU_SETSIZE / __NCPUBITS];
} cpu_set_t;

This allows parsing of ngx_setaffinity.h to advance past that point.

Using GCC options to help with parsing failures

Until Frama-C is able to successfully parse the files given in the command line, its AST is incomplete and therefore the GUI cannot be used. In this kind of situation, some GCC-specific options may help with parsing.

The first step usually consists in adding -kernel-msg-key pp to Frama-C’s command line. This will output the preprocessing command Frama-C uses when parsing each file, similar to this:

[kernel:pp]
  preprocessing with "gcc -E -C -I.  -I~/.opam/4.02.3/share/frama-c/libc -D__FRAMAC__ -D__FC_MACHDEP_X86_32 -dD -nostdinc -m32 file.c"

It is then easy to copy/paste the command in the terminal to modify it, adding/removing flags. For instance, using GCC’s -M flag, the list of files included during preprocessing is displayed as a Makefile rule. This makes it easier to identify e.g. if files from different C standard libraries were mixed together.

Conclusion

There are several techniques to help Frama-C parse realistic code bases, some consisting of “quick fixes”, others requiring more effort but providing long-term benefits in terms of stability and portability. Further changes are required according to the plug-ins to be used (e.g. function contracts for WP and Eva), but since parsing is the first step required by all plug-ins, preparing a working set is an effort that can be easily factored for other users to benefit from. This is one of the objectives of the open-source-case-studies Github repository: sharing such a configuration so that users only need to type make parse to get a ready-to-use AST, to be loaded and used with any plug-in they desire. Do not hesitate to contribute to it, either requesting or offering such parsing configurations.

André Maroneze
6th Jul 2018