In this project, you’ll make (the important parts of) a lexer for the simple C/Rust-like language I seem to be designing in the lectures. I’ve already done all the boring parts - defining the tokens and the errors, writing the tests, writing the code around the lexer - and now you’ll write the bits that… lex.

Quick start guide:

  1. If you don’t have one, create an account on GitHub.
    • If you already have an account, you don’t need to make a new one.
  2. You should have received a GitHub invitation link in your email. Click it!
    • Now you should have access to your proj1 repo.
  3. git clone it to your computer.
    • If you don’t have git installed, GitHub has tutorials for that.
  4. Edit the src/lexer.rs file to make the lexer recognize the specified tokens. (Read below.)
  5. As you work, make use of git commits to keep track of your progress.
    • Stage your changed files with git add
      • then commit them with a descriptive message: git commit -m "message"
    • I like to think of commits as “checkpoints”. Work on one step, finish it, commit, repeat.
  6. When you’re done (or when you’re not done but it’s due):
    1. Update the README.txt to include your name and username and any information you think will be useful to the grader.
    2. Also update this line in Cargo.toml:
      • authors = ["YOUR NAME <YOUR PITT EMAIL>"]
    3. Commit those changes, and…
    4. git push origin master to submit it.
      • the last commit before the due date is what we will grade.

You can also use git push origin master at any time to ensure that there’s a safe copy of your project on the GitHub servers, or to transfer your code between computers (by pushing it from one and pulling it on the other).


The starting point

If you cargo run right now, you’ll get some warnings about unused methods (that’s fine). Then you’ll be given a prompt to type code in. Right now, the lexer only recognizes three things:

Any other characters will give an error.

This interactive lexer functionality is implemented in src/bin/main.rs; you probably won’t have to change that file.

There are three important files for you:

Running the tests

At the bottom of src/lexer.rs is a set of test functions. Each function there marked with #[test] can be run with the command cargo test. Run that now, and you’ll see something like:

$ cargo test
...
running 11 tests
test lexer::tests::bad_numbers ... FAILED
test lexer::tests::complex_symbols ... FAILED
test lexer::tests::all_together ... FAILED
test lexer::tests::comments ... FAILED
test lexer::tests::good_numbers ... FAILED
test lexer::tests::identifiers ... FAILED
test lexer::tests::invalid_chars ... ok
test lexer::tests::keywords ... FAILED
test lexer::tests::simple_symbols ... FAILED
test lexer::tests::string_literals ... FAILED
test lexer::tests::whitespace ... ok

followed by detailed error messages about the tests that failed.

Your goal is to make all the tests happy! Yay!

If you see a failure message like this:

---- lexer::tests::simple_symbols stdout ----
thread 'lexer::tests::simple_symbols' panicked at 'called `Result::unwrap()` on an `Err` value:
 LexError { pos: 0, kind: InvalidChar }', src/lexer.rs:114:9

That means “the test expected your lexer to succeed, but it returned an error instead.” It shows you what error your lexer returned. You can have a look at the source of that test (in this case, fn simple_symbols) to see what it’s passing to your lexer.

The helper functions I’ve given you

The Lexer object is simple: it contains the input string broken up into a vector of characters, and keeps the current position in self.pos. But you won’t have to access self.input or self.pos directly, normally. Instead there are methods:

You don’t have to change new, ensure_eof_at_end, or lex. Instead, you’ll do this:


Let’s implement Lexer::next_token!

Have a look at that method. You can see the code that handles spaces, tabs, and newlines.

The newline match case looks like this:

'\n' => { self.next_char(); return Ok(Token::Newline);  }

This says, “when we see a '\n' character, move past it with next_char, and return a newline token.” We have to wrap that token in Ok() to indicate that the function returned successfully. Most of your lexing rules will look like this.

The last case is the “default”:

_ => return Err(invalid_char(self.pos)),

When you want to indicate an error, you return Err(...) using one of the error-construction helper functions. self.pos makes the error message point at the current position. You won’t always use the current position as the location though!

Below are tables of the tokens you will implement, in the order I recommend you implement them.


Simple symbols (test: simple_symbols)

Follow the pattern of '\n' to implement them.

Token Enum value Notes
+ Token::Plus  
- Token::Minus  
* Token::Times  
/ Token::Divide  
% Token::Modulo  
( Token::LParen yeah there’s…
) Token::RParen  
{ Token::LBrace  
} Token::RBrace  
[ Token::LBracket  
] Token::RBracket not much to say here lol
; Token::Semi  
, Token::Comma  

Harder symbols (test: complex_symbols)

These are symbols which can be multiple characters, or where there is ambiguity unless you look at the second character.

Token Enum value Notes
= Token::Assign  
== Token::Eq  
!= Token::NotEq ! is not a valid token! Give an invalid_char error at its position.
< Token::Less  
<= Token::LessEq  
> Token::Greater  
>= Token::GreaterEq  

Identifiers (test: identifiers)

These are what I’ve called Var in the example. These are names of variables, functions, classes, whatever. Here are the rules:

These are a bit more complex to implement, but not by much.


Keywords (test: keywords)

Keywords are the words that have special meaning to the programming language.

Once you’ve implemented identifiers, keywords are easy. You can match on strings as well as characters.

In the same match case where you lex identifiers, after the loop where you build up the String, do a match on it like:

match that_string.as_str() {
    "if" => ...
    ...
    _ => ... // must be an identifier
}
Token Enum value Notes
if Token::If  
else Token::Else  
for Token::For  
in Token::In  
fn Token::Fn  
let Token::Let  
while Token::While  
break Token::Break  
int Token::Int  
bool Token::Bool  
string Token::String idk why I have this notes column

Comments (test: comments)

Let’s take a detour and handle comments first.

There are only line comments, which start with // and extend until the newline character ('\n').

The comments are not tokens, but the newline at the end of them still counts and should be a Token::Newline token.

So go back to your '/' => case and expand it, kind of like the other “complex symbols.” If you see two / in a row, loop until the end of the line or the end of the input (the '\0' character) and return the appropriate token. (The test tests for both cases.)


Strings (test: string_literals)

Okay, things are getting more difficult now, because now we have to deal with a few error cases!

String literals are "double quoted" like in most languages. So when you see the '"', skip it, and start building up a string character-by-character in a loop, just like with the identifiers. When you see the closing '"', skip it and break out of the loop. Then you can return a Token::StrLit() with the string’s value.

Strings cannot span multiple lines, so if you see a '\n' character before the closing ", that’s an error. The same goes for end-of-input ('\0'). You should report that as an unclosed_string error at the opening quote’s position, not at the newline’s! So you have to store that position in a local variable beforehand.

Finally, strings can have escape sequences like many other languages. They start with a backslash '\', and can be one of the following:

If the character after a backslash is anything else, report it as an unknown_escape at the position of the backslash.

For each of these, you can push that literal character into the string, like s.push('\\'). These escape sequences are part of Rust too, so it just works!


Integers (tests: good_numbers, bad_numbers)

Final boss time. Who would have thought that integers would be the hardest part?

First have a look at the good_numbers and bad_numbers tests. Here’s the rules for integers:

To help you approach this:

Errors: there are a number of ways integers can be incorrect:


Victory lap (test: all_together)

If everything else is working, the all_together test should work too. Woo! You did it!


Submission and Grading

Project submissions will only be accepted through GitHub. If you are having trouble using git, please get in touch sooner rather than at 11:58 PM on the due date.

The project is due at midnight on the due date, but there is some wiggle room there. (We can see the time that you pushed to GitHub.)

You can turn it in for late credit until midnight on the day after the due date. You’ll get 10% off your grade.

Grading rubric: