Mutation

8.12

Mutation🔗

Based on these notes by Ben Lerner. Adding mutable variables, changes related to concrete syntax, and simplified to skip static typing and ANF.

1 Mutation🔗

Standard reminder: Every time we enhance our source language, we need to consider several things:

Its impact on the concrete syntax of the language
Examples using the new enhancements, so we build intuition of them
Its impact on the abstract syntax and semantics of the language
Any new or changed transformations needed to process the new forms
Executable tests to confirm the enhancement works as intended

2 Mutating pairs in our language🔗

2.1 Syntax and examples🔗

We’ll extend our treatment of pairs with a way to assign each component:

‹expr›: ... | ( set-fst ‹expr› ‹expr› ) | ( set-snd ‹expr› ‹expr› )

The two new expression forms allow us to modify the first or second item of the pair given by the first subexpression, and set it to the value of the second subexpression. We need to decide on what these expressions will return; let’s choose to make them return the tuple itself.

(let (x (tup 4 5)) (set-fst x 10))             # ==> (10, 5)
(let (y (tup 3 2)) (set-fst (set-snd y 8) 6))  # ==> (6, 8)

Do Now!
Flesh out the semantics of these two new expressions. What new error conditions might there be?

We can add these expression forms to our AST easily enough. This time, the new expressions will be binary primitives:

type expr = ...
  | Pair of expr * expr
type prim1 = ... | Fst | Snd
type prim2 = ... | Setfst | Setsnd

2.2 Compilation🔗

Given our representation of pairs (from Adding pairs to our language), it’s relatively straightforward to compile these two new primitives. We’ll focus on set-fst; the other is nearly identical. We first compile both subexpressions, and load them into two registers RAX and RDX. We ensure that the first argument is a pair by checking its tag, and if so then we untag the value by subtracting the tag. Now the content of the pair begins at [RAX], and the first item of the pair is located at offset zero from RAX: we simply move the second value into that location. However, our intended semantics is to return the pair value, which means we need to restore RAX to point to the pair: we must tag the value again.

mov RAX, <the pair>
mov RDX, <the new first value>
mov RCX, RAX           ;; \
and RCX, 0x7           ;; | check to ensure
cmp RCX, 0x1           ;; | RAX is a pair
jne not_a_pair         ;; /
sub RAX, 0x1           ;; untag the value into a raw pointer
mov [RAX + 8 * 0], RDX ;; perform the mutation
add RAX, 0x1           ;; tag the pointer back into a value

To compile set-snd, we do the same thing except use an offset address of [RAX + 8 * 1].

3 Generalizing to tuples🔗

The core ideas above carry through unchanged, we just need to modify the syntax to specify the index at which the tuple must be mutated:

‹expr›: ... | ( set ‹expr› ‹expr› ‹expr› )

The first expression should compute to a tuple, the second to a number, and the third to the value to assign.

Do Now!
Do not forget: what might go wrong? what should be checked dynamically?

4 Mutable variables🔗

Since we’re at it, we might want to consider adding mutable variables. Certain languages like OCaml do without: instead, mutable references are used. What would such mutable references look like? In essence, they are 1-tuple.

Do Now!
Apart from the syntax, is it convenient to support mutable references with tuples? What would be a more efficient way?

If we want to add support for mutable variables, also known as assignables, we simply need to extend the syntax with a form (set! x e), where x is a variable name (which should obviously be in scope!), and e is the expression whose value we want to assign to x.

Do Now!
How would you compile (set! x e)?

That’s right, it’s simple. We already have a way to access a variable (its location, be it in a register or on the stack), so we just need to generate the appropriate mov instruction.

5 Sequencing🔗

Now that we have mutation, we need to reconsider the ergonomics of our language. It’s rare that assigning to a field of a tuple or to a variable should be the only thing we want to compute: we likely want to mutate a field and keep going with our computation. These mutations therefore fit better into our language as statements to be executed for their side-effects, rather than as expressions to be evaluated for their answer. To achieve this, we might want to express the sequencing of multiple expressions, such that our program evaluates them all in order, but only returns the final result. We can add such concrete syntax easily:

‹expr›: ... | ( seq ‹expr› ... ‹expr› )

Do Now!
How might we implement support for this feature? Which phases of the compiler necessarily change, and which could we avoid changing?

We have two options for how to proceed from here: either we add support for sequencing in the compiler (adding a new Seq constructor to expressions, and its compilation to assembly), or we desugar sequencing into let-bindings.

Since we said that our language should evaluate subexpressions and then evaluate to the result of the last one,

(seq e1 e2)

intuitively should mean the same thing as

(let (DONT_CARE e1) e2)

Rather than create an explicit expression form, perhaps we could reuse the existing Let form and use our intuitive definition as the actual definition.

Technically, there are several ways to do this as well: we could make seq disappear in the parser directly, thereby avoiding the need to even declare that we have a Seq constructor. Another approach is to add the Seq constructor to expressions, and add a desugar : expr -> expr function that takes care of all desugaring, e.g. replacing occurrences of Seq with the Let-based expansion.

Of course, desugaring must recur throughout the program, as all of our passes do, and rewrite any sequence expressions it encounters. The advantage of this approach is that our subsequent passes can completely ignore the new sequencing expression form (they can throw an InternalCompilerException instead), and therefore no other part of our language is affected.

Alternatively, we could maintain two type declarations: one for source-level constructs (which would include all forms accepted by the parser) and one for core constructs that are actually compiled to assembly. Then the desugaring function would have type desugar : src-expr -> core-expr.

Do Now!
One problem of encoding seq using let is that let reserves stack space for its bound variable, while for seq we don’t need it. How could we avoid this?

6 Arbitrary-sized data and the infamous nil🔗

To encode arbitrary-sized data such as lists using pairs, we need a way to mark the end of the list. Many languages, like Scheme, use a nil value for this.

A better way to provide recursive, arbitrary-sized data is to support algebraic data types, such as OCaml’s type definitions, but a full treatment would take us to far for this course. So for now, we’ll consider adding the nil value to your language, despite knowing that it is a very expensive mistake.

Do Now!
How would you represent it?

The best is to simply choose a NULL pointer from C, but tag it as a tuple. This will allow us to easily evaluate (istuple nil) == true, while also ensuring that nil is guaranteed to be distinct from any valid tuple value in our language.

Do Now!
Now that we’ve added nil, what new problems do we have to handle in our compilation of tuples everywhere? Do it.

Exercise
Finally, now that we have (potentially-recursive) mutable data, write a few examples of programs that do something interesting, like sorting a list in place, or building a binary tree, etc. Rejoice in your newfound mastery over simple programs! What features should we add to our language next?

1	Mutation
2	Mutating pairs in our language
3	Generalizing to tuples
4	Mutable variables
5	Sequencing
6	Arbitrary-sized data and the infamous nil