In this project, you’ll make (the important parts of) a lexer for a simple C/Rust-like language I’m gonna call Truss because it’s like Rust but rearranged, ha ha ha

Since most of you are brand new to Rust, I’ve already done all the boring parts - defining the tokens and the errors, writing the tests, writing the code around the lexer - and now you’ll write the bits that… lex.

Quick start guide:

  1. If you don’t have one, create an account on GitHub.
    • If you already have an account, you don’t need to make a new one.
  2. You should have received a GitHub invitation link in your email. Click it!
    • Now you should have access to your proj1 repo.
  3. git clone it to your computer.
    • If you don’t have git installed, GitHub has tutorials for that.
  4. Edit the src/lexer.rs file to make the lexer recognize the specified tokens. (Read below.)
  5. As you work, make use of git commits to keep track of your progress.
    • Stage your changed files with git add
      • then commit them with a descriptive message: git commit -m "message"
    • I like to think of commits as “checkpoints”. Work on one step, finish it, commit, repeat.

    You can also use git push origin main at any time to ensure that there’s a safe copy of your project on the GitHub servers, or to transfer your code between computers (by pushing it from one and pulling it on the other).

  6. When you’re done (or when you’re not done but it’s due):
    1. Update the README.txt to include your name and username and any information you think will be useful to the grader.
    2. Also update this line in Cargo.toml:
      • authors = ["YOUR NAME <YOUR PITT EMAIL>"]
    3. Commit those changes, and…
    4. git push origin main to submit it.
      • the last commit before the due date is what we will grade.

The starting point

If you cargo run right now, you’ll get some warnings about unused methods (that’s fine). Then you’ll be given a prompt to type code in. Right now, the lexer only recognizes three things:

Any other characters will give an error.

This interactive lexer functionality is implemented in src/main.rs; you won’t have to change that file but it might be a useful program for you when testing your code.

A tour of the rest of the source:

Running the tests

In src/lex/tests.rs is a set of test functions. Each function there marked with #[test] can be run with the command cargo test. Run that now, and you’ll see something like:

$ cargo test
...
running 11 tests
test lex::tests::comments ... ok
test lex::tests::all_together ... FAILED
test lex::tests::complex_symbols ... FAILED
test lex::tests::bad_numbers ... FAILED
test lex::tests::good_numbers ... FAILED
test lex::tests::identifiers ... FAILED
test lex::tests::invalid_chars ... ok
test lex::tests::simple_symbols ... FAILED
test lex::tests::whitespace ... ok
test lex::tests::string_literals ... FAILED
test lex::tests::keywords ... FAILED

followed by detailed error messages about the tests that failed.

Your goal is to make all the tests happy! Yay!

Reading the tests and their errors

Let’s look at the complex_symbols test:

#[test]
fn complex_symbols() {
    test_ok("= == === !===", vec![Assign, Eq, Eq, Assign, NotEq, Eq, Eof]);
    test_ok("< << <= <== > >> >= >==", vec![
            Less, Shl, LessEq, LessEq, Assign,
            Greater, Shr, GreaterEq, GreaterEq, Assign, Eof]);
    test_err("!", LexErrorKind::InvalidChar, (1, 1));
}

test_ok means it’s supposed to lex successfully and give that sequence of tokens.

test_err means it’s supposed to fail and give an error at the given location: here, (1, 1) means “line 1, column 1”, which is the very first character.

Now let’s run this test. You can run a single test by naming it on the command line:

$ cargo test complex_symbols

It produces this output:

---- lex::tests::complex_symbols stdout ----
thread 'lex::tests::complex_symbols' panicked at '
on '= == === !===', expected tokens '[Assign, Eq, Eq, Assign, NotEq, Eq, Eof]', but your code gave an error:
	LexError { loc: Location { filename: "<string>", line: 1, col: 1 }, kind: InvalidChar }', src/lex/tests.rs:153:13

So you can see that it failed on the code that looks like '= == === !==='; it expected your code to give a sequence of tokens, but instead it got an InvalidChar error at line 1, column 1.

All the test failure output looks like this (showing the input that failed, the expectation, and the reality) so you can narrow down the problem more easily.

The helper functions I’ve given you

Have a look at src/lex/lexer.rs:

The Lexer object is simple: it has an iterator over the input source code (self.iter) and the output tokens (self.tokens). But you won’t have to touch these for the most part. Instead there are some helper methods:

You will not change any of those, or these methods I’ve given you: lex, new, lex_all or eat_whitespace_and_comments. Instead, you will complete the last four methods: next_token, lex_ident, lex_string, and lex_integer.


Let’s implement Lexer::next_token!

First, read lex_all, as this is the method that calls next_token. You’ll see it basically works like:

Now, read eat_whitespace_and_comments, because your lexing code is going to look similar (but without the loop). It does match self.cur() to see what the current character is, and the first case is:

' ' | '\t' | '\n' => { self.next(); }

This line says:

Now, look at what little code is in next_token’s match statement.

// EOF
'\0' => Ok(TokenKind::Eof),

// TODO: every other kind of token

// anything else is invalid.
_ => Err(LexError::invalid_char(self.iter.loc()))

The first case shows how to return a valid token. You put Ok(the value you want to return).

The second case shows how to report errors. You put Err(LexError::whatever(location)). Here it’s using the current location (self.iter.loc()) but not every error will use the current location.

Below are tables of the tokens you will implement, in the order I recommend you implement them.


Simple symbols (test: simple_symbols)

Try implementing + by adding a case inside that match:

'+' => { self.next(); Ok(TokenKind::Plus) }

It’s important to move past the character with self.next() or else lex_all will complain about being stuck!

You can test this out by doing cargo run and using the interactive prompt to type some plus signs:

>> +++
Tokens:
   Token { span: 0..1, kind: Plus }
   Token { span: 1..2, kind: Plus }
   Token { span: 2..3, kind: Plus }
   Token { span: 3..3, kind: Eof }

Neat!

Now implement the rest of these symbols the same way:

Token Enum value
+ TokenKind::Plus
- TokenKind::Minus
* TokenKind::Times
/ TokenKind::Divide
% TokenKind::Modulo
( TokenKind::LParen
) TokenKind::RParen
{ TokenKind::LBrace
} TokenKind::RBrace
[ TokenKind::LBracket
] TokenKind::RBracket
; TokenKind::Semi
: TokenKind::Colon
, TokenKind::Comma
. TokenKind::Dot

After implementing these, the simple_symbols test should now pass. Remember you can run just that test with cargo test simple_symbols.


Harder symbols (test: complex_symbols)

These are symbols which can be multiple characters, or where there is ambiguity unless you look at the second character.

Token Enum value
= TokenKind::Assign
== TokenKind::Eq
!= TokenKind::NotEq
< TokenKind::Less
<= TokenKind::LessEq
<< TokenKind::Shl
> TokenKind::Greater
>= TokenKind::GreaterEq
>> TokenKind::Shr

Identifiers (test: identifiers)

Identifiers are names of variables, functions, classes, whatever. Here are the rules:

These are what the is_ident_start and is_ident_cont functions test for.

These are a bit more complex to implement, but not by much.


Keywords (test: keywords)

Keywords are the words that have special meaning to the programming language.

Once you’ve implemented identifiers, keywords are easy. In Rust, you can match on strings as well as characters.

You’ll modify your lex_ident method. At the end where you construct and return the TokenKind::Id, replace it with this:

Ok(match that_string.as_str() {
    "if" => ...
    ...
    _ => TokenKind::Id(that_string)
})

So, it will check if the string you built up is a keyword, and if not (_ =>), it must be an identifier.

Here are the keywords you have to support:

Token Enum value
if TokenKind::If
else TokenKind::Else
for TokenKind::For
in TokenKind::In
fn TokenKind::Fn
let TokenKind::Let
while TokenKind::While
break TokenKind::Break
return TokenKind::Return
int TokenKind::Int
bool TokenKind::Bool
string TokenKind::String
and TokenKind::And
or TokenKind::Or
not TokenKind::Not
true TokenKind::True
false TokenKind::False
struct TokenKind::Struct
new TokenKind::New

Strings (test: string_literals)

Okay, things are getting more difficult now, because now we have to deal with a few error cases!

String literals are "double quoted" like in most languages. So when you see a '"', call your lex_string method; in there, skip the quote and start building up a string character-by-character in a loop, just like with the identifiers.


Integers (tests: good_numbers, bad_numbers)

Final boss time. Who would have thought that integers would be the hardest part?

First have a look at the good_numbers and bad_numbers tests to get an intuitive idea of what makes a “good” or “bad” integer.

Here are the formal grammar rules for integers:

IntLit:    DecIntLit | HexIntLit | BinIntLit
DecIntLit: DecChar (DecChar | '_')*
HexIntLit: '0' ('x'|'X') HexChar (HexChar | '_')*
BinIntLit: '0' ('b'|'B') BinChar (BinChar | '_')*
DecChar:   '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9'
HexChar:   DecChar|'a'|'b'|'c'|'d'|'e'|'f'|'A'|'B'|'C'|'D'|'E'|'F'
BinChar:   '0'|'1'

Here is an informal description of them:

To help you approach this:

Errors: there are a number of ways integers can be incorrect:


Victory lap (test: all_together)

If everything else is working, the all_together test should work too. Woo! You did it!


Submission and Grading

Project submissions will only be accepted through GitHub. If you are having trouble using git, please get in touch sooner rather than at 11:58 PM on the due date.

The project is due at midnight on the due date, but there is some wiggle room there. (We can see the time that you pushed to GitHub.)

You can turn it in for late credit until midnight on the day after the due date. You’ll get 10% off your grade.

Grading rubric: