Adding lookbehinds to rust-lang/regex

https://news.ycombinator.com/rss Hits: 6
Summary

An annotated guide to adding captureless lookbehinds to the Rust linear-time regex engine. In a previous blogpost, Erik wrote about how he implemented the linear time matching algorithm from [RegElk+PLDI24] in the popular regex engine RE2. In this one, we're looking at how to do the same thing for the official regex engine from the Rust language. Namely, we add support for unbounded captureless lookbehinds. First, let's discover the newly supported feature and its limitations. Lookbehinds allow regexes to make assertions about things preceding some part of the regex pattern without considering it as part of the match. Similarly, negative lookbehinds, which we also support, assert that something is not preceding. Consider the regex (?<=Title:\s+)\w+ which would match the following strings (matches underlined with ~): Title: HelloWorld ~~~~~~~~~~ Title: Title: foo ~~~~~ ~~~ But does not match: No heading title: bad case Title:nospace As seen in the example, lookbehind expression can be unbounded. This means we do not know ahead of time how many characters a lookbehind will match. This is an improvement over many existing regex engines which support lookbehinds but only if they are of a bounded (like the ubiquitous PCRE) or sometimes even fixed (like Python's re) length. However, as a downside our lookbehinds do not support containing capture groups which are a feature allowing to extract a substring that matched a part of the regex pattern. The actual implementation can be found in the PR to the rust-lang/regex repository. Architecture of the regex crate In the Rust ecosystem, libraries that are published are called crates. The language team maintains a handful of official crates, among which there is one called "regex" that provides exactly what you would expect: a regex matching engine. Under the hood, the regex crate is only a thin wrapper around the much more elaborate "regex-automata" crate, which provides several different engine implementations to match regexes...

First seen: 2025-07-15 17:03

Last seen: 2025-07-15 22:04