Sunday, 8 April 2012

x86 code generation quality

People often claim that it is difficult to improve upon the code generated by a modern compiler using hand-written assembler. This blog post takes a look at the code generated by GCC for a simple function and finds that it implements impressive high-level optimizations but the generated code leaves a lot of room for improvement.

Consider the recursive integer Fibonacci function that may be defined in C like this:

int fib(int n) { return (n<2 ? n : fib(n-1)+fib(n-2)); }

Writing this function in assembler by hand we obtain the following 12 instructions:

fib:
 cmp eax, 2
 jl fin
 push eax
 add eax, -1
 call fib
 push eax
 mov eax, [esp+4]
 add eax, -2
 call fib
 add eax, [esp]
 add esp, 8
fin:
 ret

Benchmarking this solution we find that it takes 3.25s to compute fib(40).

With the -O2 option we find that GCC produces the following 25 assembler instructions:

_fib:
 push edi
 push esi
 push ebx
 sub esp, 16
 mov edi, DWORD [esp+32]
 cmp edi, 1
 jbe L4
 mov ebx, edi
 xor esi, esi
L3:
 lea eax, [ebx-1]
 mov DWORD [esp], eax
 call _fib
 sub ebx, 2
 add esi, eax
 cmp ebx, 1
 ja L3
 and edi, 1
L2:
 lea eax, [esi+edi]
 add esp, 16
 pop ebx
 pop esi
 pop edi
 ret
L4:
 xor esi, esi
 jmp L2

Benchmarking we find that the performance of this solution is indistinguishable from our assembler. So the compiler generated over twice as much code for no performance improvement. Moreover, our assembler solution was comprehensible whereas the output from the compiler includes a wider vocabulary of instructions (and, xor and lea) as well as a dramatic rearrangement. Specifically, the code generated by the compiler contains only a single recursive call rather than two recursive calls.

The transformation made by the compiler cleverly turned one of the two recursive calls into a loop. Disassembling the generated code by hand we can reconstruct the following algorithmically-similar C program:

int loop(int a, int n) { return (n<=1 ? a+n : loop(a+fib(n-1), n-2)); }
int fib(int n) { return loop(0, n); }

Compiling this to assembler by hand we obtain the following 15-instruction program:

fib3:
 push ebx
 push ecx
 mov ebx, 0
 mov ecx, eax
 jmp loop2

loop1:
 lea eax, [ecx-1]
 call fib
 add ebx, eax
 sub ecx, 2
loop2:
 cmp ecx, 1
 jg loop1

 lea eax, [ebx+ecx]
 pop ecx
 pop ebx
 ret

Note that we have made one non-trivial rearrangement to create this assembler. Specifically, the loop containing a branch has been rotated so that the branch is the loop and, therefore, the entry point to the loop is now in the middle of the code. This removes one of the two branches from a naive translation.

Whereas our previous assembler and the C code both took 3.25s to compute fib(40), this new assembler takes just 2.52s. These results imply that GCC is applying some productive high-level optimizations but their benefit is lost in poor code generation. Concretely, the compiler successfully replaced one call with a jmp and, consequently, removed some stack operations but the remaining code was too inefficient for this to pay off.

Interestingly, applying the same optimization to equivalent F# code reduces the running time from 4.44s to 2.64s and in OCaml from 3.21s to 2.77s. Here is the resulting code:

let rec loop a n =
  if n<2 then a+n else
    loop (a + fib(n-1)) (n-2)
and fib n = loop 0 n

Compiling with gcc -O3 produces far more code but the running time is reduced down to 2.01s. The most productive optimization in the case of this function appears to be to unroll the recursive calls once and simplify, in particular reusing the result of the two calls to fib(n-3), resulting in this program:

let rec fib = function
  | 0 | 1 as n -> n
  | 2 -> 1
  | 3 -> 2
  | n -> fib(n-2) + 2*fib(n-3) + fib(n-4)

This reduces the running time to just 0.72s in F# and 0.39s in OCaml.

7 comments:

saynte said...

Your assembly isn't comparable to GCC's as it (appears) to be generating code that uses the cdecl calling convention, as it fetches the arguments from the stack.

I think this would introduce at least one or two extra instructions in your assembly, and give a better comparison (C-compliant code to C-compliant code).

Flying Frog Consultancy Ltd. said...

This problem does not require any specific calling convention.

saynte said...

No the problem doesn't, but you wrote one solution in C (thus using a particular calling convention) then asked GCC to compile it; C does require a specific calling convention.

Since you're hand-coding your solution you can throw away that requirement, but GCC can't take such liberties.

Jules said...

GCC chould compile a top level wrapper around fib that uses cdecl, and use a better one to do the recursion.

Flying Frog Consultancy Ltd. said...

@Saynte: As Jules said, there is nothing to stop GCC from taking such liberties.

saynte said...

@Jules, you're right, I suppose they could do that. However, these issues are not the ones brought up in the article, which is little more than: Baha, look at GCC generate longer code with weird instructions.

@Flying Frog, or you could just tell it to pass by register (fastcall convention). It just doesn't give much confidence in your method to leave such fundamental details (ie, one program is written in C) out in the wind. If you're going to compare the quality of assembly that GCC generates, why not compare it to other C compilers? Why compare it to hand-rolled assmebly, F#, and OCaml? You could have tried it with Clang, although on my machine it does worse than GCC.

Samuel A. Falvo II said...

He has a point. The extra time taken by gcc's code could be due to cache effects (line spills/fetches), so replicating calling conventions is one detail that should be replicated just to eliminate the possibility.

That being said, I wish I could've seen your version of F#'s generated assembly. I suspect it would have yielded valuable insights.

I'm a big fan of hand-coded assembly over generated code, but today's compilers usually gets things right, or "right enough."