Problem 1: Not understanding what a lexer is.#
I thought a lexer was called a parser. So, my lexer did the tokenization and evaluation. Not good. What’s more the “parser” was completely garbage it looks absolutely disgusting.
The lexer is responsible for producing tokens from source code. The parser will structure the tokens. Then interpreter will traverse the AST and be responsible for code execution.
Problem 2: Unfamiliarity with C.#
- I forgot about
*foo
andfoo[0]
1
void print_token(Scanner *scanner) {
Token token = *scanner->tokens;
for (int i = 0; token.type != END_OF_FILE; i++) {
...
printf("%s ('%c') at line %i\n", name, *lexeme, token.line);
token = scanner->tokens[i];
}
}
I created a variable for the specific token using
Token token = *scanner->tokens;
and then I printed it out at the bottom and then used the i
to get the next element. What’s the problem? *scanner->tokens
is the same as scanner->tokens[i]
. This means that I’ll print the value at index 0
twice. This resulted in my thinking something was wrong with my lexer, when it was completely fine. It was a logic error in a completely different location.
Problem 3: General carelessness1#
- I forgot about the order of operations.
When allocating an array to store my tokens I did it like this.
scanner->tokens = malloc(sizeof(Token) * length_of_src + 1)
The problem is that it allocates space for length_of_src
number of Tokens and 1 more byte. This oversight casued so much issue down the line.
scanner->tokens = malloc(sizeof(Token) * (length_of_src + 1))
The fix was the add brackets to give the length_of_src + 1
operation a higher precedence.
char* foo
vschar foo
.2
At some point I had to add multi-lexeme tokens to the lexer. For example, >=
is considered as one token not >
and =
separately. When I printed the tokens this is how I did it.
char *lexeme = token.lexeme;
printf("%s ('%c') at line %i\n", name, *lexeme, token.line);
First, char* lexeme
is a c-string. This is fine because token.lexeme
is of type char *
. At this point in time, my lexer only supports single character lexemes. As such the persepective I had was char *
is a pointer to a character not a pointer to an array of characters. This was enough to cause me lots of issues as I was wondering how I should handle cases where lexemes would be single characters long and when they would be multiple.
Arrays can have either no elements, a single element, or as many as you want. Therefore, if char* lexeme = "c"
. It just means that lexeme
has only one element in it. In other words, it is a string with one character. Then I came to the realisation that I can just print the string in it’s entirety, which resulted in this:
printf("%s ('%s') at line %i\n", name, lexeme, token.line);
Which completely fixed my issues.