Tail Call Optimisation and C / C++
At a recent Melbourne C++ meetup, the topic of tail call optimisation was raised. Tail call optimisation (or TCO) is a fascinating optimisation technique that is commonly discussed in the functional programming community, because of it’s importance in the implementation of efficient recursive algorithms. However, it is often overlooked when discussing other programming languages. This post gives brief overview of TCO, and looks at its use in C and C++ compilers.
So, what is tail call optimisation?
Call stack
A good place to start in understanding TCO, is to look at the call stack.
Most of us have probably used a debugger at some point, and used a call stack to trace the execution of a program. In a call stack, each function call is represented by a stack frame, which consists of the arguments that have been applied to a function, and typically space for local variables.
And let’s not forget a crucial element - a stack frame will also contain a return address. This is the address of the instruction that the CPU should jump to, when the function returns.
So when one function calls another, it allocates space for a new stack frame, pushes a return address and function arguments onto the stack, then jumps to the first instruction in the function that is being called. When that function returns, it pops all of these values off the stack, and jumps to the return address, allowing the first function to resume execution.
The exact details will depend on CPU architecture, calling conventions and the like, but that should give you the big picture.
Stack overflows
One of the issues that can arise with this approach is that if the call stack becomes too deep, you might get a stack overflow. This is when there is not enough space on the stack to accommodate a new stack frame.
Normally, this is not a problem, as stack space is not at a premium like it used to be. However, it can become problematic when you write recursive algorithms. A recursive algorithm is one in which you find a solution by repeatedly applying a function to its own output. A recursive function will reduce or alter its arguments until it eventually reaches a base case. A simple example is this function for computing factorials:
int factorial(int n) {
if (n > 1) {
// Recursive case
return n * factorial(n - 1);
} else {
// Base case
return 1;
}
}
We can imagine a call stack for factorial(5)
:
- factorial(5)
- factorial(4)
- factorial(3)
- factorial(2)
- factorial(1)
The problem here is that the last step in the function, a multiplication by n
, cannot complete until the recursive function call returns.
So the call to factorial(5)
cannot return until factorial(4)
returns, which in turn cannot return until factorial(3)
, and so on. This can easily grow out of control for larger values of n
.
TCO
Tail call optimisation is a strategy for avoiding this, where a compiler can optimise out the creation of a new stack frame, provided that the recursive function call is the last expression in a function.
Let’s re-write the function to make this possible:
int factorial(int accumulator, int value) {
if (value < 1) {
// Base case
return accumulator;
}
return factorial(accumulator * value, value - 1);
}
The trick here is that we’re re-written the algorithm so that instead of a multiplication being the last step of the algorithm, it is instead the recursive function call.
If we think about this at a slightly lower level (in a weird kind of low-level pseudo-code) we end up with something like this:
- pop
accumulator
andvalue
off the stack - assign
accumulator
to registerAX
- assign
value
to registerCX
- if
CX < 1
- pop return address off the stack
- push register AX onto the stack
- jump to return address (recursive function call reached the base case)
- else
- assign result of
AX * CX
to registerAX
- assign result of
CX - 1
to registerCX
- allocate space for a new stack stack frame
- push return address onto the stack
- push AX and CX onto the stack
- jump back to the top of the function (starting the recursive function call)
- pop return value off the stack (this is next instruction after each recursive function call returns)
- pop return address off the stack
- push return value back onto the stack
- jump to return address
- assign result of
Those last four steps are unwinding the stack, but this is totally unnecessary when returning from recursive function calls. The top of the stack already contains the final return value, so we could skip unwinding the stack all together.
And at this point, we have to ask… why use the stack at all? We could write this using a loop instead:
- pop
accumulator
andvalue
off the stack - assign
accumulator
to registerAX
- assign
value
to registerCX
- loop until
CX < 1
:- assign result of
AX * CX
to registerAX
- assign result of
CX - 1
to registerCX
- jump back to the beginning of the loop
- assign result of
- pop return address off the stack
- push value of
AX
onto the stack - jump to return address
This is TCO in a nutshell - take a recursive algorithm, and restructure it as a loop.
TCO in C and C++
As a general rule, the C and C++ standards do not prescribe any particular behaviour with regard to optimisations (including TCO). However, we can show that tail call optimisation has been implemented in at least one compiler. I say ‘at least one’ because the analysis below was performed using gcc 8.2. Analysis with other compilers has been left as an exercise for the reader.
Using Compiler Explorer, we can see the instructions that are generated when we compile the TCO-friendly version of factorial(n)
:
factorial(int, int):
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
cmp DWORD PTR [rbp-8], 0
jg .L2
mov eax, DWORD PTR [rbp-4]
jmp .L3
.L2:
mov eax, DWORD PTR [rbp-8]
lea edx, [rax-1]
mov eax, DWORD PTR [rbp-4]
imul eax, DWORD PTR [rbp-8]
mov esi, edx
mov edi, eax
call factorial(int, int)
nop
.L3:
leave
ret
This is compiled using gcc 8.2, with optimisation disabled (-O0
). The key thing to notice here is the call factorial(int, int)
line. This is the recursive function call we want to avoid.
Recompiling with optimisations enabled (-O1
) gives us:
factorial(int, int):
mov eax, edi
test esi, esi
jle .L5
sub rsp, 8
mov eax, esi
lea esi, [rsi-1]
imul edi, eax
call factorial(int, int)
add rsp, 8
ret
.L5:
ret
Shorter, but still recursive. Let’s try -O2
:
factorial(int, int):
mov eax, edi
test esi, esi
jle .L5
.L2:
imul eax, esi
sub esi, 1
jne .L2
.L5:
ret
We have TCO!
Good thing we wrote our function in a TCO-friendly style.
Hold up…
Now, an important question to ask is whether or not it was necessary to re-write the function to make this optimisation possible. If we inspect the non-TCO friendly version of the factorial
function, compiled with -O2
, this is what we get:
factorial(int):
mov eax, 1
cmp edi, 1
jle .L4
.L3:
imul eax, edi
sub edi, 1
cmp edi, 1
jne .L3
ret
.L4:
ret
Well, we see that TCO has been applied. While there is one additional instruction, it appears to be completely redundant. And we have the benefit of not having to supply an initial value for the accumulator.
Because we can…
Finally, I thought it would be interesting to look at the instructions that are generated for a purely iterative version of factorial
, altered to use a while
loop:
int factorial(int accumulator, int count) {
while (count > 1) {
accumulator *= count;
count--;
}
return accumulator;
}
For the benefit of this comparison, this version includes the accumulator
argument.
This is what we get with -O2
:
factorial(int, int):
mov eax, edi
cmp esi, 1
jle .L2
.L3:
imul eax, esi
sub esi, 1
cmp esi, 1
jne .L3
.L2:
ret
This is virtually identical to the TCO-friendly version of our function.
Conclusion
The take away from this is that unless you had evidence that this was a hotspot in your program, thinking about optimisations such as TCO is probably not the best use of your time!
We can just be thankful that most of the time, the compiler takes care of all of this for us.