Language Support for Marginalia Search

https://news.ycombinator.com/rss Hits: 11
Summary

One of the big ambitions for the search engine this year has been to enable searching in more languages than English, and a pilot project for this has just been completed, allowing experimental support for German, French and Swedish.These changes are now live for testing, but with an extremely small corpus of documents.As the search engine has been up to this point built with English in mind, some anglo-centric assumptions made it into its code. A lot of the research on search engines generally seems to embed similar assumptions.As this is a domain rife with unknown unknowns, the ambition for this pilot was to implement support for just a few additional languages in order to get a feel for how much work would be required to support more languages in general, as well as to assess how much the index grows when this is done.Though it was fully understood upfront that supporting all languages in one go is unrealistic, as some languages are more different than others and require significant additional work. Human language is surprisingly disparate.A language like Japanese, for example, has not only multiple alphabets, but embeds character width in unicode; on top of that the language doesn’t put spaces between words. As such the language requires special normalization.Latin, on the other hand, has dozens of forms for each word, and the words can often be reordered without significantly changing the meaning of a sentence. On the one hand this makes the grammatical analysis of the language somewhat easier since the words announce their function in the sentence fairly unambiguously, but on the other you probably need to store the text in a lemmatized form, and then strongly de-prioritize word order when matching.Google’s bungled handling of Russian was supposedly why Yandex was able to eke out a foothold in that market.What needs changingThe search engine’s language processing chain is fairly long, but the most salient parts go something like this:Text is extracted from the...

First seen: 2025-10-21 08:08

Last seen: 2025-10-21 18:11