Based on these notes by Ben Lerner. Adding mutable variables, changes related to concrete syntax, and simplified to skip static typing and ANF.
Standard reminder: Every time we enhance our source language, we need to consider several things:
Its impact on the concrete syntax of the language
Examples using the new enhancements, so we build intuition of them
Its impact on the abstract syntax and semantics of the language
Any new or changed transformations needed to process the new forms
Executable tests to confirm the enhancement works as intended
We’ll extend our treatment of pairs with a way to assign each component:
The two new expression forms allow us to modify the first or second item of the pair given by the first subexpression, and set it to the value of the second subexpression. We need to decide on what these expressions will return; let’s choose to make them return the tuple itself.
(let (x (tup 4 5)) (set-fst x 10)) # ==> (10, 5) (let (y (tup 3 2)) (set-fst (set-snd y 8) 6)) # ==> (6, 8)
Flesh out the semantics of these two new expressions. What new error conditions might there be?
We can add these expression forms to our AST easily enough. This time, the new expressions will be binary primitives:
type expr = ... | Pair of expr * expr type prim1 = ... | Fst | Snd type prim2 = ... | Setfst | Setsnd
Given our representation of pairs (from Adding pairs to our language), it’s
relatively straightforward to compile these two new primitives. We’ll focus on
set-fst; the other is nearly identical. We first compile both
subexpressions, and load them into two registers
ensure that the first argument is a pair by checking its tag, and if so then we
untag the value by subtracting the tag. Now the content of the pair begins at
[RAX], and the first item of the pair is located at offset zero from
RAX: we simply move the second value into that location. However, our
intended semantics is to return the pair value, which means we need to restore
RAX to point to the pair: we must tag the value again.
mov RAX, <the pair> mov RDX, <the new first value> mov RCX, RAX ;; \ and RCX, 0x7 ;; | check to ensure cmp RCX, 0x1 ;; | RAX is a pair jne not_a_pair ;; / sub RAX, 0x1 ;; untag the value into a raw pointer mov [RAX + 8 * 0], RDX ;; perform the mutation add RAX, 0x1 ;; tag the pointer back into a value
set-snd, we do the same thing except use an offset
[RAX + 8 * 1].
The core ideas above carry through unchanged, we just need to modify the syntax to specify the index at which the tuple must be mutated:
The first expression should compute to a tuple, the second to a number, and the third to the value to assign.
Do not forget: what might go wrong? what should be checked dynamically?
Since we’re at it, we might want to consider adding mutable variables. Certain languages like OCaml do without: instead, mutable references are used. What would such mutable references look like? In essence, they are 1-tuple.
Apart from the syntax, is it convenient to support mutable references with tuples? What would be a more efficient way?
If we want to add support for mutable variables, also known as assignables, we simply
need to extend the syntax with a form
(set! x e), where
x is a variable name (which should
obviously be in scope!), and
e is the expression whose value we want to assign to
How would you compile
(set! x e)?
That’s right, it’s simple. We already have a way to access a variable (its location, be it in a register or on the stack),
so we just need to generate the appropriate
Now that we have mutation, we need to reconsider the ergonomics of our language. It’s rare that assigning to a field of a tuple or to a variable should be the only thing we want to compute: we likely want to mutate a field and keep going with our computation. These mutations therefore fit better into our language as statements to be executed for their side-effects, rather than as expressions to be evaluated for their answer. To achieve this, we might want to express the sequencing of multiple expressions, such that our program evaluates them all in order, but only returns the final result. We can add such concrete syntax easily:
How might we implement support for this feature? Which phases of the compiler necessarily change, and which could we avoid changing?
We have two options for how to proceed from here: either we add support for sequencing in the
compiler (adding a new
Seq constructor to expressions, and its compilation to assembly),
or we desugar sequencing into let-bindings.
Since we said that our language should evaluate subexpressions and then evaluate to the result of the last one,
(seq e1 e2)
intuitively should mean the same thing as
(let (DONT_CARE e1) e2)
Rather than create an explicit expression form, perhaps we could reuse the
Let form and use our intuitive definition as the actual
Technically, there are several ways to do this as well: we could make
in the parser directly, thereby avoiding the need to even declare that
we have a
Seq constructor. Another approach is to add the
Seq constructor to expressions,
and add a
desugar : expr -> expr function that takes care of all desugaring, e.g. replacing
Seq with the
Of course, desugaring must recur throughout the program, as all of our passes do, and
rewrite any sequence expressions it encounters. The advantage of this approach
is that our subsequent passes can completely ignore the new sequencing
expression form (they can throw an
InternalCompilerException instead), and
therefore no other part of our language is affected.
Alternatively, we could maintain two type declarations: one for source-level constructs (which would
include all forms accepted by the parser) and one for core constructs that are actually
compiled to assembly. Then the desugaring function would have type
desugar : src-expr -> core-expr.
One problem of encoding
letreserves stack space for its bound variable, while for
seqwe don’t need it. How could we avoid this?
To encode arbitrary-sized data such as lists using pairs, we need a way to mark the end of the list.
Many languages, like Scheme, use a
nil value for this.
A better way to provide recursive, arbitrary-sized data is to support
algebraic data types, such as OCaml’s type definitions, but a full treatment would take us to far for this course.
So for now, we’ll consider adding the
nil value to your language,
despite knowing that it is a
How would you represent it?
The best is to simply choose a
NULL pointer from C,
but tag it as a tuple. This will allow us to easily evaluate
(istuple nil) == true, while also ensuring that
guaranteed to be distinct from any valid tuple value in our language.
Now that we’ve added
nil, what new problems do we have to handle in our compilation of tuples everywhere? Do it.
Finally, now that we have (potentially-recursive) mutable data, write a few examples of programs that do something interesting, like sorting a list in place, or building a binary tree, etc. Rejoice in your newfound mastery over simple programs! What features should we add to our language next?