Tristan Penman's Blog

Tail Call Optimisation and C / C++

08 December 2018

At a recent Melbourne C++ meetup, the topic of tail call optimisation was raised. Tail call optimisation (or TCO) is a fascinating optimisation technique that is commonly discussed in the functional programming community, because of it’s importance in the implementation of efficient recursive algorithms. However, it is often overlooked when discussing other programming languages. This post gives brief overview of TCO, and looks at its use in C and C++ compilers.

So, what is tail call optimisation?

Call stack

A good place to start in understanding TCO, is to look at the call stack.

Most of us have probably used a debugger at some point, and used a call stack to trace the execution of a program. In a call stack, each function call is represented by a stack frame, which consists of the arguments that have been applied to a function, and typically space for local variables.

And let’s not forget a crucial element - a stack frame will also contain a return address. This is the address of the instruction that the CPU should jump to, when the function returns.

So when one function calls another, it allocates space for a new stack frame, pushes a return address and function arguments onto the stack, then jumps to the first instruction in the function that is being called. When that function returns, it pops all of these values off the stack, and jumps to the return address, allowing the first function to resume execution.

The exact details will depend on CPU architecture, calling conventions and the like, but that should give you the big picture.

Stack overflows

One of the issues that can arise with this approach is that if the call stack becomes too deep, you might get a stack overflow. This is when there is not enough space on the stack to accommodate a new stack frame.

Normally, this is not a problem, as stack space is not at a premium like it used to be. However, it can become problematic when you write recursive algorithms. A recursive algorithm is one in which you find a solution by repeatedly applying a function to its own output. A recursive function will reduce or alter its arguments until it eventually reaches a base case. A simple example is this function for computing factorials:

int factorial(int n) {
  if (n > 1) {
    // Recursive case
    return n * factorial(n - 1);
  } else {
    // Base case
    return 1;
  }
}

We can imagine a call stack for factorial(5):

- factorial(5)
 - factorial(4)
  - factorial(3)
   - factorial(2)
    - factorial(1)

The problem here is that the last step in the function, a multiplication by n, cannot complete until the recursive function call returns.

So the call to factorial(5) cannot return until factorial(4) returns, which in turn cannot return until factorial(3), and so on. This can easily grow out of control for larger values of n.

TCO

Tail call optimisation is a strategy for avoiding this, where a compiler can optimise out the creation of a new stack frame, provided that the recursive function call is the last expression in a function.

Let’s re-write the function to make this possible:

int factorial(int accumulator, int value) {
  if (value < 1) {
    // Base case
    return accumulator;
  }

  return factorial(accumulator * value, value - 1);
}

The trick here is that we’re re-written the algorithm so that instead of a multiplication being the last step of the algorithm, it is instead the recursive function call.

If we think about this at a slightly lower level (in a weird kind of low-level pseudo-code) we end up with something like this:

  • pop accumulator and value off the stack
  • assign accumulator to register AX
  • assign value to register CX
  • if CX < 1
    • pop return address off the stack
    • push register AX onto the stack
    • jump to return address (recursive function call reached the base case)
  • else
    • assign result of AX * CX to register AX
    • assign result of CX - 1 to register CX
    • allocate space for a new stack stack frame
    • push return address onto the stack
    • push AX and CX onto the stack
    • jump back to the top of the function (starting the recursive function call)
    • pop return value off the stack (this is next instruction after each recursive function call returns)
    • pop return address off the stack
    • push return value back onto the stack
    • jump to return address

Those last four steps are unwinding the stack, but this is totally unnecessary when returning from recursive function calls. The top of the stack already contains the final return value, so we could skip unwinding the stack all together.

And at this point, we have to ask… why use the stack at all? We could write this using a loop instead:

  • pop accumulator and value off the stack
  • assign accumulator to register AX
  • assign value to register CX
  • loop until CX < 1:
    • assign result of AX * CX to register AX
    • assign result of CX - 1 to register CX
    • jump back to the beginning of the loop
  • pop return address off the stack
  • push value of AX onto the stack
  • jump to return address

This is TCO in a nutshell - take a recursive algorithm, and restructure it as a loop.

TCO in C and C++

As a general rule, the C and C++ standards do not prescribe any particular behaviour with regard to optimisations (including TCO). However, we can show that tail call optimisation has been implemented in at least one compiler. I say ‘at least one’ because the analysis below was performed using gcc 8.2. Analysis with other compilers has been left as an exercise for the reader.

Using Compiler Explorer, we can see the instructions that are generated when we compile the TCO-friendly version of factorial(n):

factorial(int, int):
  push rbp
  mov rbp, rsp
  sub rsp, 16
  mov DWORD PTR [rbp-4], edi
  mov DWORD PTR [rbp-8], esi
  cmp DWORD PTR [rbp-8], 0
  jg .L2
  mov eax, DWORD PTR [rbp-4]
  jmp .L3
.L2:
  mov eax, DWORD PTR [rbp-8]
  lea edx, [rax-1]
  mov eax, DWORD PTR [rbp-4]
  imul eax, DWORD PTR [rbp-8]
  mov esi, edx
  mov edi, eax
  call factorial(int, int)
  nop
.L3:
  leave
  ret

This is compiled using gcc 8.2, with optimisation disabled (-O0). The key thing to notice here is the call factorial(int, int) line. This is the recursive function call we want to avoid.

Recompiling with optimisations enabled (-O1) gives us:

factorial(int, int):
  mov eax, edi
  test esi, esi
  jle .L5
  sub rsp, 8
  mov eax, esi
  lea esi, [rsi-1]
  imul edi, eax
  call factorial(int, int)
  add rsp, 8
  ret
.L5:
  ret

Shorter, but still recursive. Let’s try -O2:

factorial(int, int):
  mov eax, edi
  test esi, esi
  jle .L5
.L2:
  imul eax, esi
  sub esi, 1
  jne .L2
.L5:
  ret

We have TCO!

Good thing we wrote our function in a TCO-friendly style.

Hold up…

Now, an important question to ask is whether or not it was necessary to re-write the function to make this optimisation possible. If we inspect the non-TCO friendly version of the factorial function, compiled with -O2, this is what we get:

factorial(int):
  mov eax, 1
  cmp edi, 1
  jle .L4
.L3:
  imul eax, edi
  sub edi, 1
  cmp edi, 1
  jne .L3
  ret
.L4:
  ret

Well, we see that TCO has been applied. While there is one additional instruction, it appears to be completely redundant. And we have the benefit of not having to supply an initial value for the accumulator.

Because we can…

Finally, I thought it would be interesting to look at the instructions that are generated for a purely iterative version of factorial, altered to use a while loop:

int factorial(int accumulator, int count) {
    while (count > 1) {
        accumulator *= count;
        count--;
    }
    return accumulator;
}

For the benefit of this comparison, this version includes the accumulator argument.

This is what we get with -O2:

factorial(int, int):
  mov eax, edi
  cmp esi, 1
  jle .L2
.L3:
  imul eax, esi
  sub esi, 1
  cmp esi, 1
  jne .L3
.L2:
  ret

This is virtually identical to the TCO-friendly version of our function.

Conclusion

The take away from this is that unless you had evidence that this was a hotspot in your program, thinking about optimisations such as TCO is probably not the best use of your time!

We can just be thankful that most of the time, the compiler takes care of all of this for us.