Abstract syntax trees are an ubiquitous data structure in language developement. However many times I ended up having to duplicate type definitions to fit the needs of different parts of the compiler. I recently started writing a proof assistant in Rust and quickly realized I needed different AST types for parsing, type checking, term unification, etc. This is known as the expression problem.
The (simplified) "base" AST looks like this:
enum Term {
// variable
Var(String),
// function application
App(Box<Term>, Box<Term>),
// function definition
Fun {
argument: String,
body: Box<Term>,
},
}
But very quickly arises the need for an AST containing type information:
enum TypedTermValue {
Var(String),
App(Box<TypedTerm>, Box<TypedTerm>),
Fun {
argument: String,
body: Box<TypedTerm>,
},
}
struct TypedTerm {
value: TypedTermValue,
type_: Type,
}
Or even an AST specialized for unification, in which terms can be inference variables:
enum UnificationTerm {
InferenceVar(usize),
Var(String),
App(Box<UnificationTerm>, Box<UnificationTerm>),
Fun {
argument: String,
body: Box<UnificationTerm>,
},
}
This duplication can become quite cumbersome when you change your AST, as you have to change many almost-equal definitions with only slight adjustments from the base AST. Notice how he main pain point comes from the recursive-ness of the type, which means we cannot just compose Term
in another type to add more information, as that would only add data to the root.
A simple and elegant solution would be to use higher kinded types, which would allow us to specify the type of subterms.
You can think of kinds as "types for types". Concrete types such as String
have kind type
but generic types can be thought of as functions taking types and returning types, for instance Vec: type -> type
(Vec<T>
) or HashMap: (type, type) -> type
(HashMap<K, V>
). This feature exists in Haskell but is (still) missing in Rust. Higher kinded types allow us to take generic types as generic type arguments.
We could then make Term
generic over an argument Rec
of kind type -> type
.
enum Term<Rec> {
Var(String),
// notice how we specialize `Rec` with `Term<Rec>`
App(Box<Rec<Term<Rec>>>, Box<Rec<Term<Rec>>>),
Fun {
argument: String,
body: Box<Rec<Term<Rec>>>,
},
}
Sadly, this is not currently possible in Rust. If we try to do this, the compiler will complain:
error[E0109]: type arguments are not allowed on type parameter `Rec`
--> src/main.rs:3:17
|
3 | App(Box<Rec<Term<Rec>>>, Box<Rec<Term<Rec>>>),
| --- ^^^^^^^^^ type argument not allowed
| |
| not allowed on type parameter `Rec`
|
note: type parameter `Rec` defined here
--> src/main.rs:1:11
|
1 | enum Term<Rec> {
| ^^^
Luckily, Rust still has a feature in its type system we can use: trait associated types. We define a trait that will act as a type family:
trait TermRec {
type Rec;
}
We can then define our base AST like this, taking our original base AST and replacing recursive fields with R::Rec
:
enum RawTerm<R: TermRec> {
Var(String),
App(Box<R::Rec>, Box<R::Rec>),
Fun {
argument: String,
body: Box<R::Rec>,
},
}
However now our RawTerm
type is not directly usable, we have to specialize it. To do that we introduce a (zero sized) marker type, implement TermRec
on it with the current Rec
associated type:
// our marker type
struct Base;
impl TermRec for Base {
type Rec = RawTerm<Base>;
}
type Term = RawTerm<Base>;
Let's break down what happens when we specialize RawTerm
with Base
. We can replace R::Rec
with RawTerm<Base>
:
enum RawTerm<Base> {
Var(String),
App(Box<RawTerm<Base>>, Box<RawTerm<Base>>),
Fun {
argument: String,
body: Box<RawTerm<Base>>,
},
}
But recall we have introduced a type alias, so we can use that to make the definition more readable:
enum Term {
Var(String),
App(Box<Term>, Box<Term>),
Fun {
argument: String,
body: Box<Term>,
},
}
That's exactly our first definition of Term
! We can use this trick to define our typed AST, where we add an extra piece of data to every node of the tree:
struct Typed;
impl TermRec for Typed {
type Rec = TypedTerm;
}
struct TypedTerm {
value: RawTerm<Typed>,
type_: Type,
}
Let's define an AST specialized for unification, our base AST with a variant added representing inference variables:
struct Unifying;
impl TermRec for Unifying {
type Rec = UnificationTerm;
}
enum UnificationTerm {
Var(usize),
Const(RawTerm<Unifying>),
}
Our base AST is now defined only once and the difference between Term
, TypedTerm
, UnificationTerm
is much clearer.
where
clauses can be used to assert that a given type implements a given trait by adding a where
clause to a non-generic function.
This is known as trivial constraints.
For instance, this code compiles because each type implements its corresponding trait:
fn _test()
where String: Clone,
Vec<i32>: Default {}
However if you try putting an unsatisfied contraint:
fn _test()
where String: Clone,
Vec<i32>: Default,
Box<i32>: Copy {}
The snippet will no longer compile:
error[E0277]: the trait bound `Box<i32>: Copy` is not satisfied
--> src/main.rs:4:11
|
4 | Box<i32>: Copy {}
| ^^^^^^^^^^^^^^ the trait `Copy` is not implemented for `Box<i32>`
|
= help: see issue #48214
We'll use this in the next part.
You may notice I never put #[derive(...)]
in any snippet before. This is not just to make the snippet simpler.
Let's try adding a #[derive(Clone)]
attribute to RawTerm<R>
and then assert Term
implements Clone
using our trick from earlier:
#[derive(Clone)]
enum RawTerm<R: TermRec> {
...
}
...
fn _test()
where Term: Clone {}
But as it turns out, derive
attributes currently do not mix well with associated types:
--> src/main.rs:26:11
|
26 | where Term: Clone
| ^^^^^^^^^^^ the trait `Clone` is not implemented for `Base`
|
note: required for `RawTerm<Base>` to implement `Clone`
--> src/main.rs:7:10
|
7 | #[derive(Clone)]
| ^^^^^ unsatisfied trait bound introduced in this `derive` macro
= help: see issue #48214
help: consider annotating `Base` with `#[derive(Clone)]`
|
17 + #[derive(Clone)]
18 | struct Base;
|
The compiler wants us to implement Clone
on the marker type, in that case Base
.
This is because #[derive(Clone)]
generates an impl
that looks like this:
impl<R: TermRec + Clone> Clone for RawTerm<R> { ... }
However this is not correct. This is an unfortunate limitation of this method,
derive
macros don't work and we need to implement all traits manually.
To do that, we need to add a trait bound to the associated type:
trait TermRec {
type Rec: Clone;
}
And then implement Clone
manually:
impl<R: TermRec> Clone for RawTerm<R> {
fn clone(&self) -> Self {
match self {
Self::Var(arg0) => Self::Var(arg0.clone()),
Self::App(arg0, arg1) => Self::App(arg0.clone(), arg1.clone()),
Self::Fun { argument, body } => Self::Fun { argument: argument.clone(), body: body.clone() },
}
}
}
This is a bit annoying since you will need to update "obvious" trait implementations when changing the AST definition. However, you can do that pretty quickly thanks to rust-analyzer by just writing the header:
impl<R: TermRec> Clone for RawTerm<R> {
}
And then using the "implement missing members" code action in your IDE.
We now have a way to deduplicate AST definitions! Arguably, this approach to the expression problem introduces a bit of complexity to your types, but I personally think the overall benefit of having very concise derived ASTs outweights this initial mental overhead.