Tail Call Optimisation in Common Lisp Implementations

While the Common Lisp standard does not require implementations to eliminate tail calls, some do—thereby allowing a more functional programming style. Especially in the prototyping phase, it can be useful to know where this functional style can be employed and where it's likely to cause stack overflow. Since I couldn't find a single document outlining support for tail call optimisation, I've written it.

In what follows, I distinguish between Tail Call Optimisation (TCO)—where all calls in tail position are optimised into jumps—and Tail Self-Call Optimisation, a special case of TCO where only self-calls are optimised (this is usually what people want when they refer to TCO as it allows for the recursive style). TCO is sometimes referred to as proper tail recursion.

This document is still incomplete; feedback would be appreciated (especially where claims are just outright wrong!).

Table of Contents

1 Findings

The following implementations were found to provide full TCO:

  • CMUCL
  • SBCL
  • CCL
  • Allegro
  • LispWorks

…only optimise self-calls in tail position:

  • CLisp
  • GCL
  • ECL (when explicitly compiled)

…neither:

  • ABCL

General points:

  • Allegro's documentation is the most explicit.
  • Most (if not all) explanations refer to disassembly listings for demonstration. YMMV.

2 Implementation Details

2.1 DONE Cmucl

TCO
yes

2.1.1 Documentation

From http://common-lisp.net/project/cmucl/doc/cmu-user/debugger.html#toc98,

Both the compiler and the interpreter are ``properly tail recursive.'' If a function call is in a tail-recursive position, the stack frame will be deallocated at the time of the call, rather than after the call returns.

"Properly tail recursive" refers to all calls in tail position as is made clear below.

2.1.2 DONE Demonstration

Using: CMU Common Lisp 20b (20B Unicode)

Let's take a look at some disassembled functions. First up, tail recursion:

* (defun foo () (foo))

FOO
* (disassemble 'foo)
; Compiling LAMBDA NIL:
; Compiling Top-Level Form:

480B7900:       .ENTRY "LAMBDA NIL"()        ; (FUNCTION NIL *)
18:       POP       DWORD PTR [EBP-8]
1B:       LEA       ESP, [EBP-32]

1E:       TEST      ECX, ECX           ; [:NON-LOCAL-ENTRY]
20:       JNE       L0

22:       MOV       EAX, [#x480B78F8]  ; #<FDEFINITION object for FOO>
; No-arg-parsing entry point
; [:NON-LOCAL-ENTRY]
28:       XOR       ECX, ECX
2A:       PUSH      DWORD PTR [EBP-8]

;;; [3] (FOO)

2D:       JMP       DWORD PTR [EAX+5]  ; [:CALL-SITE]
30: L0:   BREAK     10                 ; Error trap
32:       BYTE      #x02
33:       BYTE      #x1A               ; INVALID-ARGUMENT-COUNT-ERROR
34:       BYTE      #x4F               ; ECX

Note the JMP DWORD PTR [EAX+5] in place of a CALL; self-calls are clearly optimised into jumps. Next, let's take a look at what happens when a second function calls foo:

* (defun bar () (foo))

BAR
* (disassemble 'bar)
; Compiling LAMBDA NIL:
; Compiling Top-Level Form:

48109720:       .ENTRY "LAMBDA NIL"()        ; (FUNCTION NIL *)
38:       POP       DWORD PTR [EBP-8]
3B:       LEA       ESP, [EBP-32]

3E:       TEST      ECX, ECX           ; [:NON-LOCAL-ENTRY]
40:       JNE       L0

42:       MOV       EAX, [#x48109718]  ; #<FDEFINITION object for FOO>
; No-arg-parsing entry point
; [:NON-LOCAL-ENTRY]
48:       XOR       ECX, ECX
4A:       PUSH      DWORD PTR [EBP-8]

;;; [3] (FOO)

4D:       JMP       DWORD PTR [EAX+5]  ; [:CALL-SITE]
50: L0:   BREAK     10                 ; Error trap
52:       BYTE      #x02
53:       BYTE      #x1A               ; INVALID-ARGUMENT-COUNT-ERROR
54:       BYTE      #x4F               ; ECX

Again, there's a JMP in place of the call; the presence of foo in tail-position does not create an additional frame.

Finally, a marginally more rich example; here we compare two functions in the body of baz:

* (defun baz () (format t "Hello") (bar))

BAZ
* (disassemble 'baz)
; Compiling LAMBDA NIL:
; Compiling Top-Level Form:

48162AF0:       .ENTRY "LAMBDA NIL"()        ; (FUNCTION NIL *)
B08:       POP       DWORD PTR [EBP-8]
B0B:       LEA       ESP, [EBP-32]

B0E:       TEST      ECX, ECX           ; [:NON-LOCAL-ENTRY]
B10:       JNE       L0

B12:       MOV       EBX, ESP           ; No-arg-parsing entry point
; [:NON-LOCAL-ENTRY]
B14:       SUB       ESP, 12
B17:       MOV       EDX, 671088679     ; T
B1C:       MOV       EDI, [#x48162AE0]  ; "Hello"
B22:       MOV       EAX, [#x48162AE4]  ; #<FDEFINITION object for FORMAT>
B28:       MOV       ECX, 8
B2D:       MOV       [EBX-4], EBP
B30:       MOV       EBP, EBX

;;; [3] (FORMAT T "Hello")

B32:       CALL      DWORD PTR [EAX+5]  ; [:CALL-SITE]
B35:       MOV       ESP, EBX           ; [:SINGLE-VALUE-RETURN]
B37:       MOV       EAX, [#x48162AE8]  ; #<FDEFINITION object for BAR>
B3D:       XOR       ECX, ECX
B3F:       PUSH      DWORD PTR [EBP-8]

;;; [4] (BAR)

B42:       JMP       DWORD PTR [EAX+5]  ; [:CALL-SITE]
B45:       NOP
B46:       NOP
B47:       NOP
B48: L0:   BREAK     10                 ; Error trap
B4A:       BYTE      #x02
B4B:       BYTE      #x1A               ; INVALID-ARGUMENT-COUNT-ERROR
B4C:       BYTE      #x4F               ; ECX

Notice how the format is called with a CALL instruction (CALL DWORD PTR [EAX+5]) while bar is once again JMP'd to.

2.2 DONE SBCL

2.2.1 Documentation

SBCL's documentation (http://www.sbcl.org/manual/index.html#Debug-Tail-Recursion) should be familiar:

The compiler is "properly tail recursive." If a function call is in a tail-recursive position, the stack frame will be deallocated at the time of the call, rather than after the call returns.

2.2.2 Discussion

A few notes:

Nikodemus Siivola on sbcl-help: http://permalink.gmane.org/gmane.lisp.steel-bank.general/3272,

Tail-merging is done if (> SPACE DEBUG) or (> SPEED DEBUG) – not by default.

and http://permalink.gmane.org/gmane.lisp.steel-bank.general/3279,

jsnell once gave me this recipe to ensure that SBCL will do TCO is:

(declaim (optimize #+sbcl (sb-c::merge-tail-calls 3) #+sbcl (sb-c::insert-debug-catch 0)))

Similarly http://c2.com/cgi/wiki?TailRecursion,

sbcl does this if you use the proper optimization declarations. I know for certain that a:

(declaim (optimize (debug 0) (safety 0) (speed 3)))

will enable the optimization of tail-calls into gotos globally.

2.2.3 Demonstration

(Using 1.0.51 on Darwin)

The following should come as no surprise given the discussion of CMUCL above. Note that no optimisation has been requested using declaim in these examples.

Tail recursion:

* (defun foo () (foo))

FOO
* (disassemble 'foo)

; disassembly for FOO
; 03D20574:       488B05A5FFFFFF   MOV RAX, [RIP-91]          ; #<FDEFINITION object for FOO>
; no-arg-parsing entry point
;       7B:       31C9             XOR ECX, ECX
;       7D:       FF7508           PUSH QWORD PTR [RBP+8]
;       80:       FF6009           JMP QWORD PTR [RAX+9]
;       83:       0F0B0A           BREAK 10                   ; error trap
;       86:       02               BYTE #X02
;       87:       18               BYTE #X18                  ; INVALID-ARG-COUNT-ERROR
;       88:       54               BYTE #X54                  ; RCX
NIL

Notice the JMP instruction at 03D20580.

Full TCO:

* (defun bar () (foo))

BAR
* (disassemble 'bar)

; disassembly for BAR
; 03D9BC34:       488B05A5FFFFFF   MOV RAX, [RIP-91]          ; #<FDEFINITION object for FOO>
; no-arg-parsing entry point
;       3B:       31C9             XOR ECX, ECX
;       3D:       FF7508           PUSH QWORD PTR [RBP+8]
;       40:       FF6009           JMP QWORD PTR [RAX+9]
;       43:       0F0B0A           BREAK 10                   ; error trap
;       46:       02               BYTE #X02
;       47:       18               BYTE #X18                  ; INVALID-ARG-COUNT-ERROR
;       48:       54               BYTE #X54                  ; RCX

Here the foo is JMP'd to, not a CALL'd.

Finally, a fuller example of the above,

* (defun baz () (format t "Hello~%"))
BAZ

* (defun qux () (baz) (bar))

QUX
* (disassemble 'qux)

; disassembly for QUX
; 03E355B4:       488D5424F0       LEA RDX, [RSP-16]          ; no-arg-parsing entry point
;       B9:       4883EC18         SUB RSP, 24
;       BD:       488B059CFFFFFF   MOV RAX, [RIP-100]         ; #<FDEFINITION object for BAZ>
;       C4:       31C9             XOR ECX, ECX
;       C6:       48892A           MOV [RDX], RBP
;       C9:       488BEA           MOV RBP, RDX
;       CC:       FF5009           CALL QWORD PTR [RAX+9]
;       CF:       480F42E3         CMOVB RSP, RBX
;       D3:       488B058EFFFFFF   MOV RAX, [RIP-114]         ; #<FDEFINITION object for BAR>
;       DA:       31C9             XOR ECX, ECX
;       DC:       FF7508           PUSH QWORD PTR [RBP+8]
;       DF:       FF6009           JMP QWORD PTR [RAX+9]
;       E2:       0F0B0A           BREAK 10                   ; error trap
;       E5:       02               BYTE #X02
;       E6:       18               BYTE #X18                  ; INVALID-ARG-COUNT-ERROR
;       E7:       54               BYTE #X54                  ; RCX
NIL

Execution of baz is now performed with a CALL, but bar is optimised into a JMP as it sits in tail position.

2.3 TODO CLisp

Tail self-call optimisation
yes
TCO
seems not [1/2]
  • [X] Find a clear mention (it's not in the docs).
  • [ ] Are there any available declarations that provide full TCO?
    • So far, it seems not. JMPTAIL is used for self-calls in tail position.

The best entry point I found is http://eli.thegreenplace.net/2007/06/26/sicp-section-121/,

Indeed, CLISP implements the tail call optimization. To see one difference between them, we can run the two factorial functions on some large input (say 4000). While factorial-rec will fail with a stack overflow, factorial-iter won’t – this is because the stack does not grow.

Note that this refers to self-call TCO, not the more general form.

2.3.2 Discussion

Self-call elimination in tail position is performed by the simplify function defined in compiler.lisp:

http://clisp.hg.sourceforge.net/hgweb/clisp/clisp/file/6afcecc6bb2c/src/compiler.lisp#l8992

2.3.3 Demonstration

As of 2.49, we get a jump if we recurse from tail position (actually, the special JMPTAIL bytecode instruction),

[1]> (defun foo () (foo))
FOO
[2]> (disassemble 'foo)

Disassembly of function FOO
0 required arguments
0 optional arguments
No rest parameter
No keyword parameters
3 byte-code instructions:
0     (JMPTAIL 0 1 L4)
4     L4
4     (JMPTAIL 0 1 L4)
NIL

but not when performing a tail call,

[1]> (defun bar () 5)
BAR
[2]> (defun baz () (bar))
BAZ
[3]> (disassemble 'baz)

Disassembly of function BAZ
(CONST 0) = BAR
0 required arguments
0 optional arguments
No rest parameter
No keyword parameters
2 byte-code instructions:
0     (CALL0 0)                           ; BAR
2     (SKIP&RET 1)
NIL

The lack of general TCO is evident when we write a 'trampoline' call:

[1]> (defun foo () (format t "Hi~%") (bar))
[2]> (defun bar () (format t "There~%") (foo))
[3]> (foo)
Hi
There
[...]
*** - Program stack overflow. RESET

The above definition of foo disassembles as follows,

[4]> (disassemble 'foo)

Disassembly of function FOO
(CONST 0) = #<COMPILED-FUNCTION FOO-1>
(CONST 1) = *STANDARD-OUTPUT*
(CONST 2) = BAR
0 required arguments
0 optional arguments
No rest parameter
No keyword parameters
reads special variable: *STANDARD-OUTPUT*
5 byte-code instructions:
0     (CONST&PUSH 0)                      ; #<COMPILED-FUNCTION FOO-1>
1     (GETVALUE&PUSH 1)                   ; *STANDARD-OUTPUT*
3     (CALLSR 1 21)                       ; FUNCALL
6     (CALL0 2)                           ; BAR
8     (SKIP&RET 1)
NIL

The (CALL0 2) instruction calls consts[2], bar, with no arguments before returning.

2.4 DONE Clozure Common Lisp

TCO
yes

2.4.1 Documentation

The CCL optimisation documentation, http://trac.clozure.com/ccl/wiki/DeclareOptimize, states:

For instance, a tail-call is optimized (the caller's stack frame is reused) if the POLICY.ALLOW-TAIL-RECURSION-ELIMINATION function in the current policy returns true.

This is a tad ambiguous. The name of this policy suggest that only self-calls are eliminated; this is not in fact the case.

2.4.2 Demonstration

Using Clozure Common Lisp Version 1.7 (DarwinX8664),

? (defun foo () (foo))
FOO
? (disassemble 'foo)
;;; (defun foo () (foo))
L0
[0] (leaq (@ (:^ L0) (% rip)) (% fn))
[7] (testl (% nargs) (% nargs))
[9] (jne L21)
[11] (pushq (% rbp))
[12] (movq (% rsp) (% rbp))

;;; (foo)
L15
[15] (jmp L15)

;;; #<no source text>
L21
[21] (uuo-error-wrong-number-of-args)
NIL

So tail recursion elimination works as advertised: jmp L15. How about general tail call optimisation?

? (defun bar () (foo))
BAR
? (disassemble 'bar)
;;; (defun bar () (foo))
L0
[0] (leaq (@ (:^ L0) (% rip)) (% fn))
[7] (testl (% nargs) (% nargs))
[9] (jne L33)
[11] (pushq (% rbp))
[12] (movq (% rsp) (% rbp))

;;; (foo)
[15] (xorl (% nargs) (% nargs))
[17] (movq (@ 'FOO (% fn)) (% temp0))
[24] (leaveq)
[25] (jmpq (@ 10 (% temp0)))

;;; #<no source text>
L33
[33] (uuo-error-wrong-number-of-args)
NIL

Again, foo is jumped to: (jmpq (@ 10 (% temp0))).

Fuller example: compare baz and bar in the following,

? (defun baz () (format t ""))
BAZ
? (defun qux () (baz) (bar))
QUX
? (disassemble 'qux)
;;; (defun qux () (baz) (bar))
L0
[0] (leaq (@ (:^ L0) (% rip)) (% fn))
[7] (testl (% nargs) (% nargs))
[9] (jne L53)
[11] (pushq (% rbp))
[12] (movq (% rsp) (% rbp))

;;; (baz)
[15] (xorl (% nargs) (% nargs))
[17] (movq (@ 'BAZ (% fn)) (% temp0))
[24] (nop)
[26] (callq (@ 10 (% temp0)))
[29] (leaq (@ (:^ L0) (% rip)) (% fn))

;;; (bar)
[36] (xorl (% nargs) (% nargs))
[38] (movq (@ 'BAR (% fn)) (% temp0))
[45] (leaveq)
[46] (jmpq (@ 10 (% temp0)))

;;; #<no source text>
L53
[53] (uuo-error-wrong-number-of-args)
NIL

Here baz is executed with callq while bar, being in tail position, is jumped to.

2.5 DONE LispWorks

TCO
yes

2.5.1 Documentation

See http://www.lispworks.com/documentation/lw60/LW/html/lw-110.htm#pgfId-889221,

In 64-bit LispWorks and on x86 platforms the compiler optimizes tail calls unless

  1. The compiler optimize quality debug is 3, or
  2. There is something with dynamic scope on the stack, such as a special binding, a catch or dynamic-extent allocation (so it is not really a tail call)

On all other platforms the compiler optimizes tail calls unless 1.) or 2.) above apply, or

  • The call has more than 4 arguments and this is more than the number of fixed (not &optional / &rest / &key ) parameters in the calling function.
  • The call has more than 4 arguments and the calling function has &rest / &key parameters.

Consider, http://www.lispworks.com/documentation/lcl50/aug/aug-51.html:

Unless directed otherwise, the Compiler optimizes self-recursive function calls, tail calls, and self-tail calls. In particular, self-tail calls are automatically compiled as loops. While these function calls are efficient, they can be difficult to trace because they do not appear on the stack.

and

When the Compiler compiles either a tail call or a self-tail call, it reuses the calling function's stack frame rather than creating a new stack frame. This optimization is called tail merging.

2.5.2 DONE Demonstration

Self-calls in tail position,

CL-USER 1 > (defun foo () (foo))
FOO

CL-USER 2 > (disassemble 'foo)
200C8C1A:
0:       89E6             move  esi, esp
2:       81CEFCFF0F00     or    esi, FFFFC
8:       3966D0           cmp   [esi-30], esp
11:      7311             jnb   L1
13:      80FD00           cmpb  ch, 0
16:      750C             jne   L1
18:      55               push  ebp
19:      89E5             move  ebp, esp
21:      B500             moveb ch, 0
23:      C9               leave
24:      FF25F8867221     jmp   [217286F8]       ; FOO
L1: 30:  E8B5580500       call  2011E4F2         ; #<Function RUNTIME:BAD-ARGS-OR-STACK 2011E4F2>
35:      90               nop
36:      90               nop
37:      90               nop
NIL

…and general TCO,

CL-USER 3 > (defun bar () (format t "") (foo))
BAR

CL-USER 4 > (disassemble 'bar)
21ACF682:
0:       89E6             move  esi, esp
2:       81CEFCFF0F00     or    esi, FFFFC
8:       3966D0           cmp   [esi-30], esp
11:      7323             jnb   L1
13:      80FD00           cmpb  ch, 0
16:      751E             jne   L1
18:      55               push  ebp
19:      89E5             move  ebp, esp
21:      6803400020       push  20004003         ; T
26:      B502             moveb ch, 2
28:      B88B540A20       move  eax, 200A548B    ; ""
33:      FF158C950320     call  [2003958C]       ; LW-XP:INTERNAL-FORMAT
39:      B500             moveb ch, 0
41:      C9               leave
42:      FF25F8867221     jmp   [217286F8]       ; FOO
L1: 48:  E83BEE64FE       call  2011E4F2         ; #<Function RUNTIME:BAD-ARGS-OR-STACK 2011E4F2>
53:      90               nop
NIL

2.6 DONE Allegro

TCO
yes

2.6.1 Documentation

Tail merging, http://www.franz.com/support/documentation/current/doc/compiling.htm#tail-merge-disc-2

The two switches tail-call-self-merge-switch and tail-call-non-self-merge-switch control tail merging.

http://www.franz.com/support/documentation/current/doc/variables/compiler/tail-call-self-merge-switch.htm

If true, compiler will perform self-tail-merging (for functions which call themselves). This switch will affect code only when true at the beginning of the function being self-called (and no other time)….

Initially true when speed is greater than 0.

http://www.franz.com/support/documentation/current/doc/variables/compiler/tail-call-non-self-merge-switch.htm

If true, compiler will perform non-self-tail-merging (for functions in the tail position different than the function being executed). This switch will affect code only when true at the point of the call.

Initially true if speed is greater than 1 and debug less than 2.

2.6.2 Demonstration

Allegro 8.2 64-bit, OSX.

By default alisp does not produce the fastest possible code in the interpreter,

CL-USER> (defun foo () (foo))
FOO
CL-USER> (disassemble 'foo)
;; disassembly of #<Function (:ANONYMOUS-LAMBDA 1) @ #x2129335a>
;; formals:
;; constant vector:
0: FOO

;; code start: #x2129330c:
   0: 55             pushl      ebp
   1: 8b ec         movl        ebp,esp
   3: 83 ec 38      subl        esp,$56
   6: 89 75 fc      movl        [ebp-4],esi
   9: 89 5d e4      movl        [ebp-28],ebx
  12: 39 63 1a      cmpl        [ebx+26],esp
  15: 76 02         jbe         19
  17: cd 65         int         $101          ; SYS::TRAP-STACK-OVFL
  19: e3 02         jcxz        23
  21: cd 61         int         $97           ; SYS::TRAP-ARGERR
  23: 80 7f cb 00   cmpb        [edi-53],$0 ; SYS::C_INTERRUPT-PENDING
  27: 74 02         jz          31
  29: cd 64         int         $100          ; SYS::TRAP-SIGNAL-HIT
  31: 8b 5e 12      movl        ebx,[esi+18]  ; FOO
  34: b1 00         movb        cl,$0
  36: ff d7         call        *edi
  38: c9            leave
  39: 8b 75 fc      movl        esi,[ebp-4]
  42: c3            ret
  43: 90            nop
; No value

While (compile 'foo) will optimise the above, it does not have the desired effect across the board:

CL-USER> (defun qux () (foo))
QUX
CL-USER> (disassemble 'qux)
;; disassembly of #<Function (:ANONYMOUS-LAMBDA 3) @ #x207655fa>
;; formals:
;; constant vector:
0: FOO

;; code start: #x207655ac:
   0: 55             pushl      ebp
   1: 8b ec          movl       ebp,esp
   3: 83 ec 38       subl       esp,$56
   6: 89 75 fc       movl       [ebp-4],esi
   9: 89 5d e4       movl       [ebp-28],ebx
  12: 39 63 1a       cmpl       [ebx+26],esp
  15: 76 02          jbe        19
  17: cd 65          int        $101          ; SYS::TRAP-STACK-OVFL
  19: e3 02          jcxz       23
  21: cd 61          int        $97           ; SYS::TRAP-ARGERR
  23: 80 7f cb 00    cmpb       [edi-53],$0   ; SYS::C_INTERRUPT-PENDING
  27: 74 02          jz         31
  29: cd 64          int        $100          ; SYS::TRAP-SIGNAL-HIT
  31: 8b 5e 12       movl       ebx,[esi+18]  ; FOO
  34: b1 00          movb       cl,$0
  36: ff d7          call       *edi
  38: c9             leave
  39: 8b 75 fc       movl       esi,[ebp-4]
  42: c3             ret
  43: 90             nop
; No value
CL-USER> (compile 'qux)
QUX
NIL
NIL
CL-USER> (disassemble 'qux)
;; disassembly of #<Function QUX>
;; formals:
;; constant vector:
0: FOO

;; code start: #x205d50bc:
   0: 55            pushl       ebp
   1: 8b ec         movl        ebp,esp
   3: 83 ec 38      subl        esp,$56
   6: 89 75 fc      movl        [ebp-4],esi
   9: 89 5d e4      movl        [ebp-28],ebx
  12: 39 63 1a      cmpl        [ebx+26],esp
  15: 76 02         jbe         19
  17: cd 65         int         $101          ; SYS::TRAP-STACK-OVFL
  19: e3 02         jcxz        23
  21: cd 61         int         $97           ; SYS::TRAP-ARGERR
  23: 80 7f cb 00   cmpb        [edi-53],$0   ; SYS::C_INTERRUPT-PENDING
  27: 74 02         jz          31
  29: cd 64         int         $100          ; SYS::TRAP-SIGNAL-HIT
  31: 8b 5e 12      movl        ebx,[esi+18]  ; FOO
  34: b1 00         movb        cl,$0
  36: ff d7         call        *edi
  38: c9            leave
  39: 8b 75 fc      movl        esi,[ebp-4]
  42: c3            ret
  43: 90            nop
; No value
CL-USER>

Note the call to foo even though that call sits in tail position. This behaviour is a result of the default optimisation settings (as indicated on startup),

Optimization settings: safety 1, space 1, speed 1, debug 2

In order to get the fastest possible code, the following form can be evaluated,

CL-USER> (proclaim '(optimize (speed 3) (safety 1) (space 0) (debug 0)))
T

thereby yielding the expected results,

CL-USER> (defun bar () (bar))
BAR
CL-USER> (disassemble 'bar)
;; disassembly of #<Function (:ANONYMOUS-LAMBDA 2) @ #x212f493a>
;; formals:
;; constant vector:
0: BAR

;; code start: #x212f4904:
   0: e3 02         jcxz        4
   2: cd 61         int         $97            ; SYS::TRAP-ARGERR
   4: 80 7f cb 00   cmpb        [edi-53],$0    ; SYS::C_INTERRUPT-PENDING
   8: 74 02         jz          12
  10: cd 64         int         $100           ; SYS::TRAP-SIGNAL-HIT
  12: 8b 5e 12      movl        ebx,[esi+18]   ; BAR
  15: b1 00         movb        cl,$0
  17: ff e7         jmp         *edi
  19: 90            nop
; No value
CL-USER> (defun baz () (bar))
BAZ
CL-USER> (disassemble 'baz)
;; disassembly of #<Function (:ANONYMOUS-LAMBDA 3) @ #x21318aca>
;; formals:
;; constant vector:
0: BAR

;; code start: #x21318a94:
   0: e3 02         jcxz        4
   2: cd 61         int         $97            ; SYS::TRAP-ARGERR
   4: 80 7f cb 00   cmpb        [edi-53],$0    ; SYS::C_INTERRUPT-PENDING
   8: 74 02         jz          12
  10: cd 64         int         $100           ; SYS::TRAP-SIGNAL-HIT
  12: 8b 5e 12      movl        ebx,[esi+18]   ; BAR
  15: b1 00         movb        cl,$0
  17: ff e7         jmp         *edi
  19: 90            nop
; No value
CL-USER>

2.7 TODO GNU Common Lisp

  • Still TODO [0/2]
    • [ ] Pick apart the disassembly listings,
    • [ ] Clarify whether general TCO is ever performed i.e., not just self-calls.

2.7.1 Demonstration

Ubuntu 10.10, GCL (GNU Common Lisp) 2.6.7 CLtL1.

Note that this section is still rough

Self-calls in tail position are eliminated and replaced with JMP,

>(defun foo () (foo))

FOO

>(disassemble 'foo)

Compiling /tmp/gazonk_608_0.lsp.
End of Pass 1.

;; Note: Tail-recursive call of FOO was replaced by iteration.
End of Pass 2.
OPTIMIZE levels: Safety=0 (No runtime error checking), Space=0, Speed=3
Finished compiling /tmp/gazonk_608_0.lsp.

#include "gazonk_608_0.h"
void init_code(){do_init((void *)VV);}
/*      function definition for FOO     */

static void L1()
{register object *base=vs_base;
        register object *sup=base+VM1; VC1
        vs_check;
        vs_top=sup;
        goto TTL;
TTL:;
        goto TTL;
}

#(
#((system::%init . #((system::mf (lisp::quote user::foo) 0))))
)

static void L1();
#define VC1
#define VM1 0
static void * VVi[1]={
#define Cdata VV[0]
(void *)(L1)
};
#define VV (VVi)

/tmp/gazonk_608_0.o:     file format elf32-i386


Disassembly of section .text:

00000000 <init_code>:
init_code():
/tmp/gazonk_608_0.c:5166
   0:   55                      push   %ebp
   1:   89 e5                   mov    %esp,%ebp
   3:   83 ec 18                sub    $0x18,%esp
   6:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
   d:   e8 fc ff ff ff          call   e <init_code+0xe>
  12:   c9                      leave
  13:   c3                      ret
  14:   8d b6 00 00 00 00       lea    0x0(%esi),%esi
  1a:   8d bf 00 00 00 00       lea    0x0(%edi),%edi

00000020 <L1>:
L1():
/tmp/gazonk_608_0.c:5170
  20:   55                      push   %ebp
  21:   89 e5                   mov    %esp,%ebp
  23:   53                      push   %ebx
  24:   83 ec 04                sub    $0x4,%esp
/tmp/gazonk_608_0.c:5172
  27:   a1 00 00 00 00          mov    0x0,%eax
  2c:   3b 05 00 00 00 00       cmp    0x0,%eax
/tmp/gazonk_608_0.c:5170
  32:   8b 1d 00 00 00 00       mov    0x0,%ebx
/tmp/gazonk_608_0.c:5172
  38:   73 08                   jae    42 <L1+0x22>
/tmp/gazonk_608_0.c:5173
  3a:   89 1d 00 00 00 00       mov    %ebx,0x0
  40:   eb fe                   jmp    40 <L1+0x20>
/tmp/gazonk_608_0.c:5172
  42:   e8 fc ff ff ff          call   43 <L1+0x23>
  47:   eb f1                   jmp    3a <L1+0x1a>
T

Consider the following where format is called before recursing,

>(defun baz () (format t "") (baz))

BAZ

>(disassemble 'baz)

Compiling /tmp/gazonk_654_0.lsp.
End of Pass 1.

;; Note: Tail-recursive call of BAZ was replaced by iteration.
End of Pass 2.
OPTIMIZE levels: Safety=0 (No runtime error checking), Space=0, Speed=3
Finished compiling /tmp/gazonk_654_0.lsp.

#include "gazonk_654_0.h"
void init_code(){do_init((void *)VV);}
/*      function definition for BAZ     */

static void L1()
{register object *base=vs_base;
        register object *sup=base+VM1; VC1
        vs_check;
        vs_top=sup;
        goto TTL;
TTL:;
        base[0]= Ct;
        base[1]= ((object)VV[0]);
        vs_top=(vs_base=base+0)+2;
        Lformat();
        vs_top=sup;
        goto TTL;
}

#(
#("" (system::%init . #((system::mf (lisp::quote user::baz) 0))))
)

static void L1();
#define VC1
#define VM1 2
static void * VVi[2]={
#define Cdata VV[1]
(void *)(L1)
};
#define VV (VVi)

/tmp/gazonk_654_0.o:     file format elf32-i386


Disassembly of section .text:

00000000 <init_code>:
init_code():
/tmp/gazonk_654_0.c:5166
   0:   55                      push   %ebp
   1:   89 e5                   mov    %esp,%ebp
   3:   83 ec 18                sub    $0x18,%esp
   6:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
   d:   e8 fc ff ff ff          call   e <init_code+0xe>
  12:   c9                      leave
  13:   c3                      ret
  14:   8d b6 00 00 00 00       lea    0x0(%esi),%esi
  1a:   8d bf 00 00 00 00       lea    0x0(%edi),%edi

00000020 <L1>:
L1():
/tmp/gazonk_654_0.c:5170
  20:   55                      push   %ebp
  21:   89 e5                   mov    %esp,%ebp
  23:   57                      push   %edi
  24:   56                      push   %esi
  25:   53                      push   %ebx
  26:   83 ec 0c                sub    $0xc,%esp
29:     8b 1d 00 00 00 00       mov    0x0,%ebx
/tmp/gazonk_654_0.c:5172
  2f:   a1 00 00 00 00          mov    0x0,%eax
  34:   3b 05 00 00 00 00       cmp    0x0,%eax
/tmp/gazonk_654_0.c:5171
  3a:   8d 73 08                lea    0x8(%ebx),%esi
/tmp/gazonk_654_0.c:5172
  3d:   73 29                   jae    68 <L1+0x48>
/tmp/gazonk_654_0.c:5173
  3f:   8d 7b 04                lea    0x4(%ebx),%edi
  42:   8d b6 00 00 00 00       lea    0x0(%esi),%esi
/tmp/gazonk_654_0.c:5177
  48:   a1 00 00 00 00          mov    0x0,%eax
/tmp/gazonk_654_0.c:5176
  4d:   c7 03 00 00 00 00       movl   $0x0,(%ebx)
/tmp/gazonk_654_0.c:5178
  53:   89 1d 00 00 00 00       mov    %ebx,0x0
  59:   89 35 00 00 00 00       mov    %esi,0x0
/tmp/gazonk_654_0.c:5177
  5f:   89 07                   mov    %eax,(%edi)
/tmp/gazonk_654_0.c:5179
  61:   e8 fc ff ff ff          call   62 <L1+0x42>
  66:   eb e0                   jmp    48 <L1+0x28>
/tmp/gazonk_654_0.c:5172
  68:   e8 fc ff ff ff          call   69 <L1+0x49>
  6d:   8d 76 00                lea    0x0(%esi),%esi
  70:   eb cd                   jmp    3f <L1+0x1f>
T

GCL doesn't seem to provide generalised TCO though:

>(defun bar () (foo))

BAR

>(disassemble 'bar)

Compiling /tmp/gazonk_608_0.lsp.
End of Pass 1.
End of Pass 2.
OPTIMIZE levels: Safety=0 (No runtime error checking), Space=0, Speed=3
Finished compiling /tmp/gazonk_608_0.lsp.

#include "gazonk_608_0.h"
void init_code(){do_init((void *)VV);}
/*      function definition for BAR     */

static void L1()
{register object *base=vs_base;
        register object *sup=base+VM1; VC1
        vs_check;
        vs_top=sup;
        goto TTL;
TTL:;
        vs_base=vs_top;
        (void) (*Lnk0)();
        return;
}
static void LnkT0(){ ;} /* FOO */

#(
#(user::foo (system::%init . #((system::mf (lisp::quote user::bar) 0))))
)

static void L1();
#define VC1
#define VM1 0
static void * VVi[2]={
#define Cdata VV[1]
(void *)(L1)
};
#define VV (VVi)
static void LnkT0();
static void (*Lnk0)() = LnkT0;

/tmp/gazonk_608_0.o:     file format elf32-i386


Disassembly of section .text:

00000000 <init_code>:
init_code():
/tmp/gazonk_608_0.c:5166
   0:   55                      push   %ebp
   1:   89 e5                   mov    %esp,%ebp
   3:   83 ec 18                sub    $0x18,%esp
   6:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
   d:   e8 fc ff ff ff          call   e <init_code+0xe>
  12:   c9                      leave
  13:   c3                      ret
  14:   8d b6 00 00 00 00       lea    0x0(%esi),%esi
  1a:   8d bf 00 00 00 00       lea    0x0(%edi),%edi

00000020 <L1>:
L1():
/tmp/gazonk_608_0.c:5170
  20:   55                      push   %ebp
  21:   89 e5                   mov    %esp,%ebp
  23:   53                      push   %ebx
  24:   83 ec 04                sub    $0x4,%esp
/tmp/gazonk_608_0.c:5172
  27:   a1 00 00 00 00          mov    0x0,%eax
  2c:   3b 05 00 00 00 00       cmp    0x0,%eax
/tmp/gazonk_608_0.c:5170
  32:   8b 1d 00 00 00 00       mov    0x0,%ebx
/tmp/gazonk_608_0.c:5172
  38:   73 1e                   jae    58 <L1+0x38>
/tmp/gazonk_608_0.c:5177
  3a:   a1 08 00 00 00          mov    0x8,%eax
/tmp/gazonk_608_0.c:5173
  3f:   89 1d 00 00 00 00       mov    %ebx,0x0
/tmp/gazonk_608_0.c:5176
  45:   89 1d 00 00 00 00       mov    %ebx,0x0
/tmp/gazonk_608_0.c:5179
  4b:   83 c4 04                add    $0x4,%esp
  4e:   5b                      pop    %ebx
  4f:   5d                      pop    %ebp
/tmp/gazonk_608_0.c:5177
  50:   ff e0                   jmp    *%eax
  52:   8d b6 00 00 00 00       lea    0x0(%esi),%esi
/tmp/gazonk_608_0.c:5172
  58:   e8 fc ff ff ff          call   59 <L1+0x39>
  5d:   eb db                   jmp    3a <L1+0x1a>
  5f:   90                      nop

00000060 <LnkT0>:
LnkT0():
/tmp/gazonk_608_0.c:5180
  60:   55                      push   %ebp
  61:   89 e5                   mov    %esp,%ebp
  63:   83 ec 18                sub    $0x18,%esp
  66:   a1 00 00 00 00          mov    0x0,%eax
  6b:   c7 44 24 04 08 00 00    movl   $0x8,0x4(%esp)
  72:   00
  73:   89 04 24                mov    %eax,(%esp)
  76:   e8 fc ff ff ff          call   77 <LnkT0+0x17>
  7b:   c9                      leave
  7c:   c3                      ret
T

2.8 DONE Embeddable Common Lisp

Tail self-call optimisation
yes, when explicitly compiled
TCO
no

2.8.1 Documentation

Tail-call elimination in the recursion case is mentioned on http://ecls.sourceforge.net/new-manual/ch27s04.html.

2.8.2 Demonstration

Under ECL (Embeddable Common-Lisp) 10.3.1, Ubuntu 10.10, tail self-calls are not optimised into jumps by default:

> (defun foo () (foo))

FOO
> (disassemble 'foo)

Name:               FOO
Documentation:      NIL
Declarations:       ((C-GLOBAL))
0   NOMORE
1   CALLG   0,FOO
4   EXIT
NIL

Compiling functions, however, does yield self-call elimination in tail position. Consider the following transcript,

ECL (Embeddable Common-Lisp) 10.3.1
Copyright (C) 1984 Taiichi Yuasa and Masami Hagiya
Copyright (C) 1993 Giuseppe Attardi
Copyright (C) 2000 Juan J. Garcia-Ripoll
ECL is free software, and you are welcome to redistribute it
under certain conditions; see file 'Copyright' for details.
Type :h for Help.
Top level in: #<process SI:TOP-LEVEL 09858fc0>.
> (defun foo () (bar))

FOO
> (defun bar () (foo))

BAR
> (foo)

C-STACK overflow at size 4227072. Stack can probably be resized.

Available restarts:

1. (CONTINUE) Extend stack size
2. (RESTART-TOPLEVEL) Go back to Top-Level REPL.

Broken at BAR. In: #<process SI:TOP-LEVEL 09858fc0>.
>> (compile 'foo)

;;; OPTIMIZE levels: Safety=2, Space=0, Speed=3, Debug=0
;;;
;;; End of Pass 1.
;;; Note:
;;;   Invoking external command:
;;;   i686-linux-gnu-gcc "-I/usr/include/"  -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -g -O2 -fPIC  -D_THREAD_SAFE -Dlinux -O -w -c "/tmp/ECL001QMn7Ix.c" -o "/tmp/ECL001QMn7Ix.o"
;;;
;;; Note:
;;;   Invoking external command:
;;;   i686-linux-gnu-gcc -o "/tmp/ECL001QMn7Ix.fas" -L"/usr/lib/" "/tmp/ECL001QMn7Ix.o"   -shared -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions  -lecl  -lgmp -lpthread -ldl  -lm
;;;
FOO
NIL
NIL
>> (compile 'bar)

;;; OPTIMIZE levels: Safety=2, Space=0, Speed=3, Debug=0
;;;
;;; End of Pass 1.
;;; Note:
;;;   Invoking external command:
;;;   i686-linux-gnu-gcc "-I/usr/include/"  -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -g -O2 -fPIC  -D_THREAD_SAFE -Dlinux -O -w -c "/tmp/ECL002Bmf0vt.c" -o "/tmp/ECL002Bmf0vt.o"
;;;
;;; Note:
;;;   Invoking external command:
;;;   i686-linux-gnu-gcc -o "/tmp/ECL002Bmf0vt.fas" -L"/usr/lib/" "/tmp/ECL002Bmf0vt.o"   -shared -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions  -lecl  -lgmp -lpthread -ldl  -lm
;;;
BAR
NIL
NIL
>> (foo)

Debugger received error: C-STACK overflow at size 4227072. Stack can probably be resized.
Error flushed.
>> (defun baz () (baz))

BAZ
>> (baz)

;;;
;;; Stack overflow.
;;; Jumping to the outermost toplevel prompt
;;;


Broken at BAR. In: #<process SI:TOP-LEVEL 09858fc0>.
>> (compile 'baz)

;;; OPTIMIZE levels: Safety=2, Space=0, Speed=3, Debug=0
;;;
;;; End of Pass 1.
;;; Note:
;;;   Invoking external command:
;;;   i686-linux-gnu-gcc "-I/usr/include/"  -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -g -O2 -fPIC  -D_THREAD_SAFE -Dlinux -O -w -c "/tmp/ECL0035PQrQP.c" -o "/tmp/ECL0035PQrQP.o"
;;;
;;; Note:
;;;   Invoking external command:
;;;   i686-linux-gnu-gcc -o "/tmp/ECL0035PQrQP.fas" -L"/usr/lib/" "/tmp/ECL0035PQrQP.o"   -shared -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions  -lecl  -lgmp -lpthread -ldl  -lm
;;;
BAZ
NIL
NIL
>> (baz)
[...runs indefinitely...]

Once compiled, baz doesn't terminate. Note that both before and after compilation, trampoline code blows the stack.

2.9 DONE Armed Bear Common Lisp

TCO
no
Tail Self-Call Optimisation
no

2.9.1 Discussion

  • http://blog.sol42.com/yacb/20100805-armed-bear-common-lisp

    This is a well known JVM limitation (you have to resort to loops and trampolines). The Common Lisp standard does not require TCO, but every other Lisp seems to do it, and I, coming from ML and Erlang, am used to programming in a very recursive style. But this is just me, orthodox lispers will tell you that you have to use loops.

2.9.2 Documentation

From the Armed Bear trac, http://trac.common-lisp.net/armedbear/wiki/SystemExecution,

As described elsewhere, the LispObject is the base component of which the Lisp world in ABCL has been built. Everything is a LispObject. LispObject has 10 methods which might get called when evaluating that "Lisp object": they're all called execute and implement the zero - nine argument forms and the array argument form (which allows for non-predetermined numbers of arguments).

When an object is evaluated one of the appropriate forms of the execute method is called. Code deeply nests calls to execute() methods, because the evaluation of a function inherently calls execute() methods to call other functions.

This system leads to large stack sizes: Java doesn't allow tail call elimination.


Date: 2011-10-14 12:14:23 BST

Author: Marc Simpson

Org version 7.7 with Emacs version 24

Validate XHTML 1.0