OCaml's option -compact can optimize for code size and speed
Pascal Cuoq - 11th Oct 2013The OCaml compiler can generate native code for various architectures. One of the few code generation options is named -compact
:
$ ocamlopt -help Usage: ocamlopt <options> <files> Options are: ... -compact Optimize code size rather than speed ...
One of the effects, perhaps the only effect(?), of this option is on the code generated for allocation. In OCaml, a small block typically gets allocated from the minor heap. When all goes well, allocating in the minor heap means simply decrementing the pointer that separates the already-allocated top of the minor heap from the still-available bottom of the minor heap. Eventually the minor heap becomes full; what happens then is complicated. But the fast path is basically a subtraction and a conditional branch (usually not taken) to the complicated stuff.
With option -compact
, an allocation simply calls a routine that does all the above. Without it, the subtraction and conditional jump are inlined at the site where the allocation is requested. OCaml being a typical functional language, there are plenty of these. Inlining a couple of instructions makes the code bigger, but when execution stays on the fast path, which is often, a routine call is avoided completely.
A small example
Consider the one-line OCaml function let f i = Some i
. This function takes a value i
and returns a Some
memory cell that points to i
. I intended i
to be an integer, but I inadvertently wrote code that works for any type of i
. This was possible because in OCaml, all types of values share a uniform representation, as discussed in the introduction of a previous post.
A Some
cell occupies 16 bytes (yes this is huge. Blame 64-bit computing). The fast code generated by ocamlopt -S t.ml
where L102 is the label at which selcom-called code manages the case when the minor heap is full:
... subq $16 %r15 movq _caml_young_limit@GOTPCREL(%rip) %rax cmpq (%rax) %r15 jb L102 ...
The -compact
option causes more compact code to be emitted for the allocation. When our function f
is compiled with ocamlopt -S -compact t.ml
the sequence generated for the allocation is much shorter indeed. There is even a specialized allocation function for allocating 16-byte blocs so no arguments need to be set up:
... call _caml_alloc1 ...
Effects of option -compact in the large
When the development version of Frama-C is compiled with option -compact
the binary takes 14.4MB.
Without option -compact the Frama-C binary weights in at 15.4MB.
One additional megabyte of code is not going to hurt us at this point right? It is not as if the Frama-C binary had to fit exactly in a box of ten floppy disks. As long as the larger code is faster it is worth it…
$ time bin/toplevel.lean_and_mean tests/test/adpcm.c -sparecode > /dev/null user 0m0.843s $ time bin/toplevel.big_and_slow tests/test/adpcm.c -sparecode > /dev/null user 0m0.856s
… but it turns out that the larger code isn't faster. In the case of Frama-C and of the computer I am typing on at least the code generated with option -compact
is smaller and faster.
Conclusion
It is not easy to try to predict performance on modern out-of-order architectures but if I had to hazard an explanation I would say that besides the obvious cache concerns having many scattered predictable conditional branches may be a losing proposition compared to one single conditional branch visited from many places with call
and ret
. Every conditional branch that isn't in the branch prediction buffer is at risk of being mispredicted. In addition each entry taken by the branches that have recently been visited is an entry that isn't available for branches translated from conditionals in the source code.
In contrast the execution of the call
and ret
instructions has been optimized over the years. The target of the ret
instruction is predicted with a Return Stack Buffer that has a conceptually simpler job than the Branch Predictor (although it is clearly possible to write code for which the ret
instructions cause a stall which could re-establish an advantage for the partially inlined version).
The conclusion is that everyone who uses OCaml should try the native compiler's option -compact
and that perhaps the default should be for this option to be on.