Summary of a 2010 research article

Pascal Cuoq - 27th Sep 2011

A number of colleagues and I have been tasked with writing a summary of an article they published in 2010. Each of us has to do this for an article of theirs, that is. I chose the article whose long version title is \A Mergeable Interval Map" (link PDF available in English starting with page number 10). The article describes a data structure to associate values to intervals. I like this data structure because it is uncomplicated and its necessity was made apparent by the implementation of previous ideas. The article on the other hand may not do as good a job as it should of explaining the constraints that led to invent a new data structure. This is an opportunity to try and improve that part.

Without further information concerning the intended audience I decided to assume that the reader is vaguely familiar with the low-level aspects of C (he does not need to be a C programmer). It should also help if he is vaguely familiar with say one modern programming language (Perl Python PHP C# or OCaml).

A new data structure for static analysis

When working on an analyzer for low-level embedded C whose selling point is that it automatically and correctly analyzes relatively large programs relatively precisely your analyzer will inevitably encounter a C program that does something like this:

union U 
  {  char c[4]; 
      int *p; }; 

The analyzer does not necessarily have to handle type-punning through the above union very precisely (this for one requires hypotheses on the target platform's endianness and pointer size) but it should handle it correctly (or "soundly" as some prefer saying). If the union is used to manipulate the bytes of a pointer and the pointer is copied somewhere else as a consequence the analyzer should be aware that the member c of the struct from which the program reads overlap with member p through which an address was previously written.

main(){ 
  u1.p = &i; 
  for (i=0; i<4; i++) 
    u2.c[i] = u1.c[i]; 
  *u2.p = 7; 
  return i; 
} 

The program above copies the address of variable i byte by byte from u1 to u2 and finally uses the copied address to surreptitiously write 7 into i. An analyzer may lose track a little of what is happening here and still remain correct. It may say that i may contain any integer at the end of the program (and ask the programmer to make sure that he is accessing a valid address with *u2.p). This is what frama-c -val does. Alternately a correct analyzer may precisely follow what is happening in this program and conclude that the program always returns 7 (this is what frama-c -val -slevel 10 does). But a correct analyzer should never say that the program returns 4 since this is not what happens in an actual run.

To return to the initial topic in order to be correct for the above program the value analysis must understand that members c and p of union u1 overlap. The way it does this is by representing the memory range reserved for a variable (in this example u1 u2 or i) as a map from intervals to values. After the first instruction in main() the memory for u1 can be represented as a map from interval 0..31 to {{ &i }} assuming that we use the bit as the basic memory unit and that a pointer occupies 32 bits in memory.

The map is a very useful data structure natively provided in many modern programming languages where it may also be called "dictionary" or "associative array". A map is a collection of bindings between keys and values. Bindings can be added removed and the map can be queried for the value associated to a given key. Typical implementations used in Perl or Python rely on hash tables an efficient representation that allows all these operations.

It was not possible to use hash tables for the interval-indexed maps needed by the value analysis and one of the reasons was that the basic set of operations does not include "find the bindings with a key (interval) that intersects the key (interval) I have". Instead we used a binary search tree (with intervals as keys). Compared to hash tables binary search trees allow to look in the neighborhood of a key. For instance when the analyzer looks for the interval 0..7 in the tree that contains a binding for 0..31 (in the first iteration of the for loop in our example) it does not find it but it finds that the interval 0..7 is contained in 0..31 and that therefore the binding 0..31 : {{ &i }} is pertinent. This is how the value analysis detects that the address of i that was previously written in memory through u1.p is being copied to u2 in the for loop.

One general consideration with binary search trees is that some sort of heuristic is needed to keep them balanced. A balanced tree has the property that at each node the number of bindings on the left-hand side is about the same as the number of bindings on the right-hand side. Looking up keys (intervals) in a binary search tree is fast for all keys only if the tree is balanced. The initial implementation of the value analysis used AVL trees as provided in OCaml's standard library. AVL trees are self-balancing: when the tree changes existing bindings may be moved about as needed to keep lookups efficient.

Implementation and experimentation revealed that there was another desirable property the ideal data structure should have for our use case.

int t[16]; 
... 
if (... complex condition...) 
  t[0] = 3; 
else 
  t[1] = 4; 

In this new example with AVL trees and their automatic re-balancing operations the map representing t may end up balanced differently when propagated through the "then" and the "else" branches of the program. Say that the initial map is represented as below:

tri.png

In the memory state after the "then" branch the contents at index 0 are updated according to the assignment in that branch:

trat.png

In the memory state after the "else" branch the contents at index 1 are updated. Because of automatic re-balancing operations while this happens the resulting tree may look like:

trae.png

Next two very common operations in static analysis happen. The first is checking whether a memory state is included in another so that only one needs to be kept. Here neither state from the "then" and "else" branch contains the other. The following operation is to build the union of the two memory states. The result may be represented by the tree below.

trf.png

With AVL trees it is normal for the tree to be re-balanced automatically when operated on. But both inclusion and union operations would be much easier if the trees coming from the then branch and the else branch were always balanced identically. If they were both balanced in the same way the initial state was it would be possible to recognize at a glance that the subtree for indices 2 3 4 is the same in the states coming from both branches and to re-use it as a subtree of the result.

Our contribution in the article "A Mergeable Interval Map" is a new kind of binary search tree that uses intervals as keys and that remains balanced through insertions and deletions but with the additional property that trees that have different operations applied to them remain balanced the same way. We call this property "mergeability" and it makes inclusion and union easier to compute. This means that a static analyzer using this representation is faster and consumes less memory than with a less adequate representation.

Pascal Cuoq
27th Sep 2011