General tips for debugging preprocessing and parsing issues
André Maroneze - 18th Sep 2024Motivated by some recent discussions concerning a hard-to-parse code base, this post presents a few techniques used by Frama-C developers to quickly understand and debug parsing-related errors. We are constantly improving parsing and error messages, so hopefully some of these tips will become unnecessary in the future.
Origin of parsing issues
Many parsing issues in Frama-C come from compiler extensions; GCC being the most commonly-used compiler on Linux, and Linux being the most common operating system among Frama-C users, lots of code that people want to analyze with Frama-C relies on non-portable GCC-specific mechanisms: glibc-based code, compiler intrinsics. Some issues come from additions to C99 (_Alignof
, new C23 types, etc) or non-standard types (e.g. BSD-based code). Finally, a few of them are actually parsing issues deserve to be fixed in Frama-C. Distinguishing between them allows one to more quickly solve them, with or without the help of the Frama-C team.
Note that the Frama-C AST construction comprises several passes:
- Per-file preprocessing (usually done by GCC);
- Per-file logic preprocessing (necessary for some ACSL annotations);
- Actual parsing, typing and per-file AST construction (including lexing, parsing, elaboration, typing); this includes the infamously large Cabs2cil module (which converts from an abstract AST, Cabs, to the C Intermediate Language (CIL));
- “Symbolic linking”, also known as mergecil, which combines the types and definitions from several source files into a single, unified AST.
This is transparent to users most of the time, except when things stop working. In such cases, you may have to take a look under the hood of Frama-C’s kernel.
Case study: embedul.ar
This post was motivated by some difficulties found while parsing embedul.ar. Some examples will be extracted from its repository, in particular commit 5ff7fa7f, if you want to retry them.
Note: It is very likely that the results presented below will no longer produce the same result, due to changes in Frama-C and/or embedul.ar. They are only meant as a general and very pragmatic guide, but the recommendations are applicable in many scenarios.
First attempt: running Frama-C
The embedul.ar project has a nice frama-c
Makefile target. It does require an existing compile_commands.json
. The easiest way to obtain one is to use Bear (Build EAR):
bear -- make
This will compile the code (assuming you have the required libraries) and produce a compile_commands.json
file. You can then run:
make frama-c
And it will (hopefully) parse the code and run Eva.
Some warnings such as this one may appear:
typedef] /usr/include/SDL2/SDL_stdinc.h:550: Warning:
[kernel:parser:unnamed-typedef that does not introduce a struct or enumeration type Ignoring unnamed
Looking at the source code, its origin is unclear:
549 switch (dwords % 4) {
550 case 0: do { *_p++ = _val; SDL_FALLTHROUGH;
551 case 3: *_p++ = _val; SDL_FALLTHROUGH;
552 case 2: *_p++ = _val; SDL_FALLTHROUGH;
553 case 1: *_p++ = _val;
554 } while ( --_n );
555 }
There is nothing seeming to indicate an unnamed typedef.
However, there is a strong clue that the issue might be related to macros: SDL_FALLTHROUGH
, in uppercase letters, is likely a preprocessor macro. And since Frama-C delegates preprocessing to external tools (such as GCC), its error messages cannot be as precise as those from the compiler.
Therefore, we must try other means to obtain more information about what is going on here.
Tip #1: use -kernel-msg-key pp
-kernel-msg-key pp
activates some preprocessing-related messages. It also prevents erasing some temporary files produced during Frama-C parsing. These temporary files can help understand why things go wrong. Note that not all intermediate files are kept.
The message we are looking for here is the full preprocessing command used by Frama-C. With it, we can re-run preprocessing using GCC, then feed the preprocessed file to Frama-C, and hopefully obtain a more precise warning.
In embedul.ar, the command running Frama-C is in makefiles/build.mk
. We just need to add our option somewhere in the command line.
Note: if the Frama-C command line contains one or more -then
(or its variants, such as -then-last
), you need to add the option before them.
This will add a lot of messages, several concerning the JSON Compilation Database. We recommend redirecting stdout
to a file so you can grep it later.
We then scroll through the messages down to the first occurrence of the previous warning. The useful lines are just before it:
...
[kernel] Parsing /embedul.ar/source/arch/native/sdl/boot/board_hosted.c (with preprocessing)
[kernel:pp]
Full preprocessing command: cd /embedul.ar && gcc -E -C -I. -I/__fc_machdep560779.dir (... omitted for brevity ...) '/embedul.ar/source/arch/native/sdl/boot/board_hosted.c' -o '/board_hosted.c78cf06.i'
[kernel:pp:logic]
logic preprocessing with "gcc -E -C -I. -Wno-builtin-macro-redefined -Wno-unknown-warning-option '/tmp/ppannot9679ca.c' -o '/tmp/cppannot61d718.c'"
[kernel:parser:unnamed-typedef] /usr/include/SDL2/SDL_stdinc.h:550: Warning:
Ignoring unnamed typedef that does not introduce a struct or enumeration type
There is a lot to unpack here:
- Frama-C was parsing the
board_hosted.c
file when the warning was emitted; - The kernel then proceeded (with the help of the JSON Compilation Database) to call GCC for preprocessing;
- The kernel then called the preprocessor again, but this time for logic annotations (
[kernel:pp:logic]
); The warning happened after these steps.
We will manually re-run the preprocessing, but save the result in a file we can easily inspect and modify. By copying the (enormous) command line and adding -o pp.ci
at the end, we’ll get a pp.ci
file with expanded macros.
Tip #2: use .ci
for partially preprocessed files
The “correct” extension for a preprocessed file, in C, is .i
. This is a well-known fact and most compilers know that: when you give them a .i
file, they skip preprocessing, ignoring any line starting with a #
.
However, Frama-C has TWO preprocessing steps: one for the code, another for the logic, that is, ACSL annotations.
If you named your file in the previous step as pp.i
, and tried to run Frama-C, you would obtain something like this:
[kernel] Parsing pp.i (no preprocessing)
[kernel:annot-error] FRAMAC_SHARE/libc/__fc_gcc_builtins.h:213: Warning:
unbound logic variable __CHAR_BIT. Ignoring specification of function __builtin_clz
[kernel] User Error: warning annot-error treated as fatal error.
[kernel] Frama-C aborted: invalid user input.
Note the no preprocessing message in the first line.
Then, we have an annot-error
message about __CHAR_BIT
. Where does that come from? If you look at the source code, you will see:
/*@
requires x_nonzero: x != 0;
assigns \result \from indirect:x;
ensures result_is_bit_count: 0 <= \result < __CHAR_BIT * sizeof(x);
*/
int __builtin_clz(unsigned int x);
This is a specification from Frama-C’s standard library (a GCC builtin, in this case) that contains a reference to a preprocessor symbol, __CHAR_BIT
. This logic annotation needs to be preprocessed, but the .i
extension told Frama-C to skip it altogether.
Therefore, Frama-C recognizes the .ci
extension, which means “partially preprocessed source”. If you rename pp.i
as pp.ci
and pass it to Frama-C, it will preprocess its annotations, without any errors. It will display:
[kernel] Parsing pp.ci (external front-end)
(The “external front-end” is a misnomer due to reuse of a kernel feature, but here it simply means “preprocessing logic”.)
Note that you could also just keep the .c
name for the file. Frama-C will redo some preprocessing, and emit some warnings, but overall the results should be similar. For instance, you may have:
<built-in>: warning: "__STDC__" redefined
<built-in>: note: this is the location of the previous definition
Which means that Frama-C already had its definition for __STDC__
, but when preprocessing the .c
file it encountered it again. Still, in this case both approaches allow us to get the warning that we were trying to reproduce:
[kernel:parser:unnamed-typedef] /usr/include/SDL2/SDL_stdinc.h:550: Warning:
Ignoring unnamed typedef that does not introduce a struct or enumeration type
But this time, we have a self-contained, preprocessed file, so we can hack it to get the exact warning.
Tip #3: remove line directives from preprocessed files when debugging
Note that the location mentioned in the previous warning refers to a file in /usr/include
, even though we just parsed a local file without any #include
directives. This is due to the presence of line directives in the source. If we remove them, we will have the actual line in our preprocessed source:
sed '/^# /d' pp.ci > pp_clean.c $
Note the .c
extension here; ideally, we would like to use .ci
to avoid the redefinition warnings as before, but there is a minor caveat: when we do so, the warning messages refer to the temporary file created by Frama-C, not to our source file. So we keep the .c
here and ignore the redefinition warnings. They now reveal the true source of the issue:
[kernel:parser:unnamed-typedef] pp_clean.c:19526: Warning:
Ignoring unnamed typedef that does not introduce a struct or enumeration type
And this time, when we look at the source code, we have the code without any macros:
switch (dwords % 4) {
case 0: do { *_p++ = _val; __attribute__((__fallthrough__));
case 3: *_p++ = _val; __attribute__((__fallthrough__));
case 2: *_p++ = _val; __attribute__((__fallthrough__));
case 1: *_p++ = _val;
while ( --_n );
} }
Now we can confirm it: the warning is completely misplaced. We should report it as a bug, ideally in a minimal, self-contained, reproducible code.
Tip #4: if possible, minimize the code, either by hand, or with C-Reduce
In this case, it is not hard to extract the relevant parts manually; we simply need the definition for type Uint32_t
, and some standard headers. Because we already have preprocessed code, some simple backward greps are enough to amass the relevant definitions:
// repro.c (or repro.i)
typedef unsigned long size_t;
typedef unsigned int uint32_t;
typedef uint32_t Uint32;
static __inline__ void SDL_memset4(void *dst, Uint32 val, size_t dwords) {
__attribute__((always_inline)) size_t _n = (dwords + 3) / 4;
Uint32 *_p = ((Uint32 *)(dst));
Uint32 _val = (val);if (dwords == 0)
return;
switch (dwords % 4) {
case 0: do { *_p++ = _val; __attribute__((__fallthrough__));
case 3: *_p++ = _val; __attribute__((__fallthrough__));
case 2: *_p++ = _val; __attribute__((__fallthrough__));
case 1: *_p++ = _val;
while ( --_n );
}
} }
If we are feeling fancy (or want to practice C-Reduce), we can even try some C-Reduce magic to do the minimization for us (requires the creduce
or cvise
tools installed by your own means):
$ frama-c-script creduce repro.c
(...)
Which kind of error are you reducing for?
1. Fatal crash (error message with stacktrace)
2. Non-fatal error
Please enter 1 or 2: 2
Script copied. Please edit ./script_for_creduce.sh and re-run this script.
Now we open the newly-created ./script_for_creduce.sh
file, find the line INSERT CONDITION(S) RELATED TO FRAMA-C HERE
and, following the examples given a few lines before it, add some lines:
grep -q "Ignoring unnamed typedef" "$fc_out"
test $fc_retcode -eq 0
The first line ensures that the reduced program will always contain the problematic message. The second line ensures that Frama-C will parse the file without errors (otherwise C-Reduce tends to wreak havoc during its minimization).
Now we simply re-run frama-c-script creduce repro.c
, and wait while it does its magic. In about a minute or so, it finishes with something like this:
void f() { __attribute__(()); }
Which confirms that (1) Frama-C is emitting a meaningless warning about an attribute in a statement (somehow relating it to a typedef), and (2) this warning can be safely ignored in the original source code.
Conclusion
Frama-C is not exempt of parsing issues, especially “benign” ones such as this. Non-standard code, especially when coming from system libraries, is full of compiler extensions such as __attribute__
. C23 is finally introducing them with a [[attribute]]
syntax, but there will still be much legacy code for years to come. Understanding a bit more about how Frama-C processes such code can help quickly diagnose and even report issues when they arise. With a few kernel options, and some knowledge about the underlying process, parsing becomes much more predictable and easier to understand. The complexity introduced by a preprocessing step is one of the reasons most modern languages use different approaches nowadays.
Note: the specific issue of handling statement attributes, mentioned in this post, has already been taken care of in the development version of Frama-C; but similar issues can be investigated and reported using the tips mentioned in this post.