Strategies for Fast Lexers

https://news.ycombinator.com/rss Hits: 5
Summary

In this blog post I’ll explain strategies I used to make the purple garden lexer really fast.purple-garden is an s-expr based language I am currently developing for myself. Its my attempt at building a language I like, with a battery included approach, while designing it with performance in mind.This doesn’t mean all approaches are feasible for your use case, architecture and design. I tried to bring receipts for my performance claims, so watch out for these blocks at the end of chapters:Info - BenchmarkIntroduction to LexingA lexer (often also referred to as a tokeniser) is the easiest part of any compilation and language pipeline. The idea is to convert a list of characters into a list of tokens in which each token conveys some meaning. This list of tokens can then be used by the parser to generate an abstract syntax tree (AST), which the compiler consumes, converting it to bytecode, which the vm executes.A compilation pipelineAs an overview:┌───────────────────┐ │ │ │ Lexer │ <- we are here │ │ └─────────┬─────────┘ │ │ Tokens <- and here │ ┌─────────▼─────────┐ │ │ │ Parser │ │ │ └─────────┬─────────┘ │ │ AST │ ┌─────────▼─────────┐ │ │ │ Compiler │ │ │ └─────────┬─────────┘ │ │ Bytecode │ ┌─────────▼─────────┐ │ │ │ Virtual Machine │ │ │ └───────────────────┘For a list of characters, lets say (@std.fmt.println "my pi is: " 3.1415):Input to the lexer:TEXT1(@std.fmt.println "my pi is: " 3.1415)As characters:TEXT 1( 2@ 3s 4t 5d 6. 7f 8m 9t 10// .... 11)As pseudo tokens:TEXT1( 2@std 3. 4fmt 5. 6println 7"my pi is: " 83.1415 9)The above is just an example and I’ll go into detail below:Defining purple garden’s TokensA token is not only a set characters it can be mapped to, but it also holds:A token type, to easily distinguish between tokensPositional information:start pointend or lengthline numberPurple garden keeps it as minimal as possible and just has a String and a token type:C 1typedef enum { 2 // ( 3 T_DELIMITOR_LEFT = 1, 4 // + 5 T_PLUS = 2, 6 // - 7 T_MINUS =...

First seen: 2025-07-14 16:00

Last seen: 2025-07-14 20:00