Introduction With modern, overengineered, and over-obfuscated websites, we at SerpApi face increasing challenges with extracting data from them. Beside the usual HTML parsing, sometimes we're literally forced to fall back to good 'ol regular expressions, e.g. for extracting embedded JS data. And while regexps do the trick, they might come at a cost. Onigmo, the default regexp engine in Ruby, while substantially updated in Ruby 3.2, still has weak points that may really upset in terms of scan time, adding latency to our search requests. Let's find out what alternatives are available in the wild and how they compare to Ruby. Contenders re2 It's developed by Google, and it's widely used in various Google products. Under the hood it uses what they call "an on-the-fly deterministic finite-state automaton algorithm based on Ken Thompson's Plan 9 grep". It is stated that re2 was designed with an explicit goal of being able to handle regular expressions from untrusted sources, i.e. to be resistant from ReDoS attacks. There is well-maintained Ruby bindings gem. rust/regex Native regex engine in Rust. According to rebar, it's one of the fastest engines overall, and it uses the same approach of building DFA during the search time as re2. There are no up-to-date, ready-to-use Ruby bindings, so I've created a simple PoC for this comparison. pcre2 One of the best-known regex engines due to wide adoption across many commercial and open-source products, as well as languages like PHP and R, where it's used as a default one. It supports a separate JIT mode that improves search time significantly in most cases. Unfortunately, Ruby bindings are outdated and do not work properly. For instance, mentioned above JIT cannot be enabled with the latest binaries, making the engine not worth to be compared. Benchmarks The benchmarks presented here are the variations of rebar ones. Specifically, those that are validated with count and count-spans models. The following results were gathered using...
First seen: 2025-05-02 19:42
Last seen: 2025-05-03 00:42