This is a semi-formal specification of the toy language used in this course, named Truss.

Lexically and syntactically, it’s somewhat similar to Rust, but semantically it’s a little more like Java.

Table of Contents


Lexical structure

The source code is assumed to be encoded in UTF-8; that means the alphabet of the lexical grammar is Unicode code points.

The source code is made up of tokens (which are meaningful) and whitespace (which is not). A lexer should produce as output only the meaningful tokens and ignore the whitespace.

The lexical grammar is as follows, using typical metalanguage conventions, and <angle brackets> for hand-wavey bits:

Source: (Whitespace? Token)* Whitespace? Eof
Eof:    <the actual end of the source file>

Whitespace:  ' ' | '\t' | '\n' | Comment
Comment:     "//" CommentChar* CommentEnd
CommentChar: <any character except '\n' or Eof>
CommentEnd:  ('\n' | Eof)<non-consuming>

Token:   Keyword | Symbol | Id | IntLit | StrLit

Keyword: "and" | "bool" | "break" | "else" | "false" |
         "fn" | "for" | "if" | "in" | "int" | "let" |
         "new" | "not" | "or" | "return" | "string" |
         "struct" | "true" | "while"

Symbol:  "=" | "+" | "-" | "*" | "/" | "%" | "<<" | ">>" |
         "<" | ">" | "<=" | ">=" | "==" | "!=" | "(" | ")" |
         "{" | "}" | "[" | "]" | ";" | ":" | "," | "."

Id:      IdStart IdCont*
IdStart: Alphabetic | '_' | '$'
IdCont:  IdStart | DecDigit

IntLit:    DecIntLit | HexIntLit | BinIntLit
DecIntLit: DecDigit (DecDigit | '_')*
HexIntLit: '0' ('x'|'X') HexDigit (HexDigit | '_')*
BinIntLit: '0' ('b'|'B') BinDigit (BinDigit | '_')*

StrLit:    '"' (StrChar | StrEscape)* '"'
StrChar:   <any character except '"', '\n', or Eof>
StrEscape: "\\" | "\"" | "\r" | "\n" | "\t"

DecDigit:   <'0' through '9' inclusive>
HexDigit:   DecDigit | <'a' through 'f' inclusive> | <'A' through 'F' inclusive>
BinDigit:   '0' | '1'
Alphabetic: <anything with the Unicode "Alphabetic" class>

Syntax

The syntactic grammar is as follows. The alphabet of this grammar is tokens, and tokens are informally represented either as 'strings' or as <Token Names>.

Program:    Decl*
Decl:       VarDecl | FuncDecl | StructDecl

VarDecl:    'let' Id '=' Exp ';'

FuncDecl:   'fn' Id '(' FuncArgs? ')' (':' Type)? BlockStmt
FuncArgs:   FuncArg (',' FuncArg)*
FuncArg:    Id ':' Type

StructDecl: 'struct' Id (':' Id)? '{' FieldDecls FuncDecl* '}'
FieldDecls: FieldDecl (',' FieldDecl)* ','?
FieldDecl:  Id ':' Type

Stmt:        BlockStmt | IfStmt | WhileStmt | ForStmt | ExpStmt |
             AssignStmt | ReturnStmt | LetStmt
BlockStmt:  '{' Stmt* '}'
IfStmt:     'if' Exp BlockStmt ('else' (BlockStmt | IfStmt))?
WhileStmt:  'while' Exp BlockStmt
ForStmt:    'for' Id 'in' Exp ',' Exp BlockStmt
ExpStmt:    Exp ';'
AssignStmt: Exp '=' Exp ';'
ReturnStmt: 'return' Exp? ';'
LetStmt:    VarDecl

Exp:       Term (BinOp Term)*
Term:      UnOp* Primary PostfixOp*
Primary:   Id | IntLit | StrLit | TrueLit | FalseLit | Ctor | '(' Exp ')'

BinOp:     '+'|'-'|'*'|'/'|'%'|'<<'|'>>'|'<'|'>'|'<='|'>='|'=='|'!='|'and'|'or'
UnOp:      '-'|'not'

PostfixOp: FuncCall | Field
FuncCall:  '(' (Exp (',' Exp)*)? ')'
Field:     '.' Id

Id:        <TokenKind::Id>
IntLit:    <TokenKind::IntLit>
StrLit:    <TokenKind::StrLit>
TrueLit:   'true'
FalseLit:  'false'
Ctor:      'new' Id '(' (Exp (',' Exp)*)? ')'

Type:       VoidType | 'bool' | 'int' | 'string' | FuncType | StructType
VoidType:   '(' ')'
FuncType:   'fn' '(' (Type (',' Type)*)? ')' ':' Type
StructType: Id

Operator precedence and associativity

All binary operators are left-associative. Assignment = is not a binary operator and cannot appear within expressions.

All unary operators (negation and logical not) are right-associative. All postfix operators (calls and field access) are left-associative.

Precedence is as follows, with 1 being the highest precedence (evaluated first):

Operators Precedence
f(x), f.x 1
-x, not x 2
* / % 3
+ - 4
<< >> 5
< <= > >= == != 6
and 7
or 8

Syntactic enforcement of return statements

To simplify the semantic analysis phases, Truss requires that non-void functions return values using a syntactic check. There are three parts to this:

  1. A statement is said to be a returning statement if:
    • it is a return statement itself
    • it is a { block statement } whose last statement is a returning statement
    • it is a while or for statement whose code block is a returning statement
    • it is an if statement, with an else clause, and both the “then” side and the “else” side are returning statements.
  2. If a returning statement appears within a { block statement }, it must be the last statement in the block.
    • e.g. { return; print("hi"); } is syntactically invalid code.
  3. Any function whose syntactic return type is anything other than () must have a code block that is a returning statement.
    • functions without a declared return type are also considered ().

These rules may give false errors, but they will never allow a function that must return a value to fail to do so. For example, this would be flagged as invalid code and would require a return statement after the loop:

fn bad(): int {
    while true {
        if someCondition() {
            return 0;
        }
    }
}

Examples of returning statements:

// return statement
return;

// block statement whose last statement is returning
{ return; }

// ditto
{
    print(5);
    print(10);
    return;
}

// ditto
{
    {
        {
            return;
        }
    }
}

// while statement whose code is returning
while condition() {
    print(5);
    return;
}

// for statement whose code is returning
for i in 0, 10 {
    return;
}

// if-else where both clauses are returning
if condition() {
    return 5;
} else {
    return 10;
}

// any depth of nesting is fine
if condition() {
    while condition() {
        if condition() {
            return 5;
        } else {
            return 10;
        }
    }
} else {
    print(5);
    return 15;
}

These are not returning statements:

// block statement whose last statement is not returning
{ print(5); }

// if with no else (even though the "then" clause is returning)
if condition() {
    return 10;
}

// if-else where only one side is returning
if condition() {
    print(5);
} else {
    return;
}

// while statement that contains no returning statements
while condition() {
    if x {
        return;
    }

    doSomething();
}

Semantics

A Truss source code file consists of a sequence of function, global variable, and structure declarations. This area outside any function is called “top-level.”

Global variables

A variable declared at top-level is a global variable.

let glob = 0;

Global variables’ initializers must be syntactically constant values. So 0 is okay, but 1 + 1 is not, even though it could be statically evaluated to a constant.

At runtime, global variables are mutable.

Local variables

A variable declared within a function is a local variable.

fn test() {
    let x = 10;
}

Local variables’ initializers can be any expression as they are evaluated at runtime. They are also mutable at runtime.

Scoping

The top-level comprises a global scope; the order of declarations does not matter, and any function can access anything from the global scope.

fn test() {
    // it's fine to access a global variable declared later
    print(glob);
}

fn caller() {
    // calling another function from the global scope
    test()
}

let glob = 10;

Local variables are visible from the statement after their declaration, until the closing brace of the enclosing { block statement }.

It is an error for two local variables in the same { block statement } to have the same name:

fn bad() {
    let x = 10;
    let x = 20; // error
}

However, it is not an error to declare a local variable of the same name in a nested block statement:

fn good() {
    let x = 10;

    while true {
        let x = 20; // a new variable x; shadows the other
        print(x);   // prints 20
    }
}

It is also possible to declare a local variable of the same name as a global variable; in that case, the global variable is shadowed by the local.

Name resolution

All names are resolved starting at the scope in which the reference occurs and moving outward to the enclosing scopes, ending with the top-level. The first name found in this traversal is what the name refers to.

This applies even in cases where the name could otherwise unambiguously refer to a specific name, such as this case:

struct Point { x: int, y: int }
fn bad() {
    let Point = 10;      // shadows the global Point
    let p = new Point(); // error: Point is not a type
}

Type system

A summary of types and their descriptions:

Type Name Description
() “void” type; absence of a value
bool boolean truth value; can be true or false
int signed 2’s complement 32-bit integer
string immutable sequence of codepoints; by reference
fn(): () function pointer type; by reference
S user-defined structure type; by reference

Below are more detailed descriptions as well as what operations can be performed on variables of these types.

Typing rules


Functions

Functions can appear in two places: at top-level and within struct declarations. The former are called free functions; the latter are called methods and have some extra features which are discussed in the section on structures. All functions otherwise behave similarly.

Function calls work as in any other mainstream language. There is a single call stack onto which an activation record is pushed when the function is called. That activation record contains the values of the local variables declared within that function, local to that particular activation. This is so recursive functions can work without separate activations overwriting each others’ local variables. When the function returns (completes execution), its activation record is removed from the call stack.

Functions can take any number of arguments, and they are passed by value; that is, functions cannot modify local variables of the caller (or any other function on the call stack). Functions can optionally return a value.

There is no ad-hoc function overloading. It is an error to have two functions of the same name.

It is possible to have a variable which is a function pointer. For example:

fn test() {
    println("test");
}

fn main() {
    test();       // call it directly
    let p = test; // p is a function pointer
    p();          // calls test indirectly
}

Since test is of type fn(): (), so is p; therefore p could be reassigned to point to any function of the same type.


Structures

Structures are user-defined compound types that support a simple form of inheritance and subtype polymorphism. Despite their name, they have much more in common with Java’s classes than Rust’s structs. Sue me.

References

All structures are dynamically allocated on the heap, and all variables of structure type are references to these heap-allocated objects.

Fields

A structure can have 0 or more fields. These are the “instance variables” that are allocated per-instance of the structure.

A structure may not have a field named constructor, as that would interfere with the constructor mechanism.

All structures have one hidden field that is used to point to the virtual method table and runtime type identification info. The interpretation of this field is described in the runtime representation section. Because of this hidden field, a struct with 0 user-defined fields is not 0-sized.

Allocation and initialization

Struct instances are allocated with a new S() expression. There are four steps to this:

  1. Enough space to hold the instance’s fields is allocated on the heap.
  2. The instance’s fields are initialized to all 0 bits.
    • This means any fields of reference type will contain a special value, “null.”
    • Performing any operation on a null value is a runtime error and will cause the program to halt.
    • (There is no way to write a null value in this language, which is a weird omission. FUTURE WORK.)
  3. If a method named constructor exists in the struct, it will be called on the new instance using the arguments passed in the parentheses (see below about how methods work).
    • If there is no such method, this step is skipped.
  4. The expression evaluates to the pointer to the instance that was just allocated.

Methods

A structure can have 0 or more methods. These are functions declared within the struct body. They are different from free functions in one important regard: they have a hidden argument named this whose type is the structure the method appears within.

So in this code, within the method meow, this refers to the instance of Cat on which the method was called:

struct Cat {
    fn meow() {}
}

In this code, the object c will be passed as the this argument to meow:

let c = new Cat();
c.meow();

If a method is named constructor, it takes on a special meaning: it will be called when a new instance of this struct is allocated using new. For example:

struct S {
    name: string,

    fn constructor(name: string) {
        this.name = name;
    }
}

fn main() {
    let s = new S("Bob");
}

The last line does something like this invalid code:

    let s = allocate S;
    s.constructor("Bob");

Struct inheritance

A struct can optionally inherit from another struct, called its base type. When it does so, it becomes a derived type. For example:

struct Animal {
    species: string
}

// Cat is a derived type of Animal
struct Cat: Animal {
    num_whiskers: int
}

A derived type:

The last point means structs have virtual methods which are explained in detail in a subsequent section.

It is an error if the inheritance hierarchy is anything other than a tree; i.e. no cycles are allowed:

// bad!
struct S: T {}
struct T: S {}

Inherited fields

Derived structs inherit all the data fields of their base type. There are not multiple hidden fields; instead the derived struct’s hidden field replaces the base type’s.

It is an error for the derived struct to have a field of the same name of any of its base types. All field names must be unique.

Inherited methods and Virtual methods

Derived structs inherit all the methods of their base type. However, those methods will still operate on instances of the derived struct as if they are an instance of the base struct.

It is an error for the derived struct to have a method of the same name but different type signature as any of its base types. However, if a method has the same name AND type signature, this is then considered an overridden virtual method.

When an overridden virtual method is called on an instance of a base type, the actual implementation to use is decided at runtime using the struct instance’s hidden field. This is called dynamic dispatch. Consider this code:

struct Base {
    fn print() { println("base print"); }
}

struct D1: Base {
    fn print() { println("D1 print"); }
}

struct D2: Base {
    fn print() { println("D2 print"); }
}

fn print(b: Base) {
    b.print(); // performs dynamic dispatch
}

fn main() {
    print(new Base()); // prints "Base print"
    print(new D1());   // prints "D1 print"
    print(new D2());   // prints "D2 print"
}

Evaluation rules

These are the rules for how every piece of code is evaluated (or executed) at runtime.

Expression evaluation

Statement evaluation

Expression Statements (f();)

The expression is evaluated for its side effects. The expression must have type (). That is, it is an error for an expression that gives a value to be used as an expression statement.

fn test(): int { return 5; }

fn main() {
    "hello"; // invalid; has type string
    test();  // invalid; has type int
}

Assignment Statements (x = y;)

In the above, x is the destination expression and y is the source expression.

It is an error if the destination is not one of:

Evaluation is as follows:

  1. The destination is evaluated up to the last step.
  2. The source is evaluated fully.
  3. The value from the source is assigned into the destination by finishing evaluation of the destination.

This sounds a bit confusing so here is an example. Given this code:

f(x, y, z).w = g();

The steps occur in this order, where the $name variables are temporaries:

$dst = f(x, y, z);
$src = g();
$dst.w = $src;

Let Statements (let x = y;)

These work just like an assignment, but since the destination is just a simple variable, there is nothing to evaluate before evaluating the source.

Block Statements ({ ... })

The statements in the block (if any exist) are executed sequentially, from top to bottom.

If Statements (if x { ... })

First, the condition is evaluated.

If the condition is true, the “then” block (which comes right after the condition) is executed.

If the condition is false and there is an else statement, that is executed instead.

If the condition is false and there is no else statement, execution is complete.

While Statements (while x { ... })

  1. the condition is evaluated.
  2. if the condition is false, execution is complete.
  3. if the condition is true, the body is executed, and return to step 1.

For Statements (for i in 0, 10 { ... })

  1. the lower bound is evaluated and assigned into the counter variable.
  2. the upper bound is evaluated and assigned into a temporary (call it upper).
  3. if the counter variable ≥ upper, execution is complete.
  4. otherwise:
    • the body is executed
    • the counter is incremented by 1
    • return to step 3

Return Statements (return;, return x;)

It is an error for the type of value being returned to differ from the return type of the function it appears in.

  1. if there is a value to return, it is evaluated.
  2. the current function’s execution is terminated and control returns to the caller.

Runtime representation

This documents the reference implementation’s runtime representation; other implementations are free to change the details of this as long as the resulting program still behaves identically.

Memory

There are three areas of memory:

Heap memory management

The heap is ostensibly garbage-collected using a tracing algorithm. However the reference implementation uses a null garbage collector: no memory is ever freed and reused. While clearly inefficient, it is correct, trivial to implement, and it’s just fine for a little toy language that’s gonna be used to write tiny toy programs and nothing else okay

Structures

Each structure has a memory layout that looks like:

Fields are offset from the beginning of the struct and are aligned as follows:

The size of the entire struct is padded up to the next multiple of the largest alignment of any field.

So for these structures:

struct S    { x: int }
struct T: S { y: bool, z: bool }
struct U: T { w: string }

Their layouts are as follows, assuming a pointer is 4 bytes:

S’s fields Offsets
(hidden) 0
x 4
(struct size) 8
T’s fields Offsets
(hidden) 0
x 4
y 8
z 9
(struct size) 12
U’s fields Offsets
(hidden) 0
x 4
y 8
z 9
w 12
(struct size) 16

Virtual Method tables (vtbls)

The hidden field of each struct instance points to a virtual method table (vtbl) which is statically constructed by the compiler. This table lists all methods in the struct, in the order they were declared. (Static method dispatch is an optimization which this implementation does not perform.)

Derived structs’ vtbls are prefixed by their base struct’s vtbl. This way, the same methods appear at the same offsets in the vtbls.

If a derived struct overrides a base struct’s implementation of a method, its vtbl will replace that method’s entry, rather than adding a new entry to the vtbl.

So for these structures:

struct B {
    fn constructor() {}
    fn toString(): string { return "B"; }
}

struct D: B {
    fn scream() { println("AHHHHH!"); }
}

struct O: B {
    fn toString(): string { return "O"; }
}

Their vtables would be as follows (where Struct.name means “the method name as declared in Struct).

B’s vtbl:

Index Method
0 B.constructor
1 B.toString

D’s vtbl:

Index Method
0 B.constructor
1 B.toString
2 D.scream

O’s vtbl:

Index Method
0 B.constructor
1 O.toString

Method calls

The code o.method() performs a function call (as described in the next section), but with two extra caveats:

  1. First the method is looked up in the vtbl
  2. Then the method is called, but the receiver is passed as a hidden first argument

The lookup works as follows:

  1. the hidden field is loaded from the receiver
  2. the appropriate method is indexed out of the vtbl

So o.method() will do something like:

$receiver = o; // o could be any arbitrary expression, so it's evaluated once
$method = $receiver.vtbl[index]; // based on the name of the method
$method($receiver, other arguments);

Function call ABI (MIPS implementation)

The ABI is simplified from the standard MIPS ABI to make codegen simpler.

All arguments are passed on the stack. Methods receive their this argument as their first argument.

All locals are also stored on the stack.

The return value is always returned in register v0.

The following registers are considered scratch registers usable for any purpose: v1, a0-a3, t0-t9. Their values may change across a function call.

The following registers are considered saved registers: s0-s7. Their values must remain the same across a function call. Callees are responsible for saving and restoring their values, if they wish to use them.

1. Caller calls a function

  1. The caller makes space on the stack for the function’s arguments.
    • it does this by subtracting 4 * (the number of arguments) from sp.
  2. It evaluates the arguments, storing their results into those stack slots.
    • The arguments are stored in ascending order based on sp
    • so the first argument is in 0(sp), the second in 4(sp), the third in 8(sp) etc.
  3. it calls the function directly or indirectly:
    • when calling free functions directly, jal is used.
    • when calling struct methods (where the method is indexed out of the vtbl), or when calling function pointer variables, jalr is used.

2. Function prologue

When the callee begins execution, the arguments are already on the stack thanks to the caller, so sp is pointing to the first argument.

  1. The current value of fp (the frame pointer) is stored at -4(sp).
  2. The current value of ra (the return address) is stored at -8(sp).
  3. The current value of sp is moved into fp.
  4. The current values of any s registers the function uses are stored at negative offsets from fp below any locals.
  5. Last, space is made on the stack for the locals and saved registers by decrementing sp like so:
    • each local gets 1 stack slot.
    • fp and ra each get 1 stack slot.
    • each s register needed to be saved gets 1 stack slot.
    • so, sp is decremented by the 4 * (the sum of all needed stack slots).
      • e.g. if there are 3 locals and 2 s registers are used, plus the 2 slots for fp and ra, then sp is decremented by 4 * (3 + 2 + 2) = 28.

This sets up the stack frame as follows (assuming a function with 3 arguments, 3 locals, and 2 saved registers):

 Value       Accessed as...
-----------+-----------------
 arg 3     | 8(fp)
 arg 2     | 4(fp)
 arg 1     | 0(fp)  <------------- fp
 stored fp | -4(fp)
 stored ra | -8(fp)
 local 3   | -12(fp)
 local 2   | -16(fp)
 local 1   | -20(fp)
 stored s1 | -24(fp)
 stored s0 | -28(fp)  <------------- sp

After this, the function’s code can execute.

fp is used for both because it does not move, simplifying codegen. sp will move during execution of the function, such as when calling other functions.

3. Function epilogue

If a return statement is executed (and the function returns a value), it will place that value in v0.

No matter how the function exits, the prologue will then execute:

  1. Any s registers are loaded from -24(fp), -28(fp) etc.
  2. ra is loaded from -8(fp).
  3. fp is loaded from -4(fp).
  4. sp is incremented enough to completely clean the stack, including the arguments.
    • so in this example, it will be incremented by 40 bytes.
  5. jr ra is executed to return to the caller.

4. Caller resumes execution

Because the callee cleaned the stack, the caller does not have to do anything other than use the return value in v0 (if there is one).