A high-throughput parser for the Zig programming language

https://news.ycombinator.com/rss Hits: 15
Summary

Accelerated Zig Parser A high-throughput tokenizer and parser (soon™️) for the Zig programming language. The mainline Zig tokenizer uses a deterministic finite state machine. Those are pretty good for some applications, but tokenizing can often employ the use of other techniques for added speed. Two tokenizer implementations are provided. A version that produces a few bitstrings per 64-byte chunk and uses those to skip over continuation-character matching. I gave two talks on this subject. (Currently this code has gone poof, but I will resurrect this for comparison's sake within 3 months (when I give my final Utah-Zig talk on the subject of the Zig Tokenizer in July)) A version that produces bitstrings for EVERYTHING we want to do within a 64-byte chunk, and utilizes vector compression to find the extents of all tokens simulataneously. See this animation. I also gave a talk (really more of a rant) about my grand plans here. Unfortunately it did not turn out how I had hoped because I got sick before I had time to give it the love it deserves. But my next talk shall knocketh thy socks off, guaranteed! The test bench as it sits on my computer right now prints this out when I run it: Read in files in 26.479ms (1775.63 MB/s) and used 47.018545MB memory with 3504899722 lines across 3253 files Legacy Tokenizing took 91.419ms (0.51 GB/s, 38.34B loc/s) and used 40.07934MB memory Tokenizing with compression took 33.301ms (1.41 GB/s, 105.25B loc/s) and used 16.209284MB memory That's 2.75x faster and 2.47x less memory than the mainline implementation! And I still have more optimization plans >:D !!! Stay tuned! See my article on the new tokenizer, here: https://validark.dev/posts/deus-lex-machina/ Tokenizer 1: Everything beneath this notice was written with regards to Tokenizer 1. The information is a little out-of-date but the optimization strategies are still applicable. Click here to see my latest work. Results Currently the utf8 validator is turned off! I did a lot of perfo...

First seen: 2025-04-16 16:18

Last seen: 2025-04-17 06:24