On Wednesday, Wikimedia Deutschland announced a new database that will make Wikipedia’s wealth of knowledge more accessible to AI models. Called the Wikidata Embedding Project, the system applies a vector-based semantic search — a technique that helps computers understand the meaning and relationships between words — to the existing data on Wikipedia and its sister platforms, consisting of nearly 120 million entries. Combined with new support for the Model Context Protocol (MCP), a standard that helps AI systems communicate with data sources, the project makes the data more accessible to natural language queries from LLMs. The project was undertaken by Wikimedia’s German branch in collaboration with the neural search company Jina.AI and DataStax, a real-time training-data company owned by IBM. Wikidata has offered machine-readable data from Wikimedia properties for years, but the pre-existing tools only allowed for keyword searches and SPARQL queries, a specialized query language. The new system will work better with retrieval-augmented generation (RAG) systems that allow AI models to pull in external information, giving developers a chance to ground their models in knowledge verified by Wikipedia editors. The data is also structured to provide crucial semantic context. Querying the database for the word “scientist,” for instance, will produce lists of prominent nuclear scientists as well as scientists who worked at Bell Labs. There are also translations of the word “scientist” into different languages, a Wikimedia-cleared image of scientists at work, and extrapolations to related concepts like “researcher” and “scholar.” The database is publicly accessible on Toolforge. Wikidata is also hosting a webinar for interested developers on October 9th. Techcrunch event San Francisco | October 27-29, 2025 The new project comes as AI developers are scrambling for high-quality data sources that can be used to fine-tune models. The training systems themselves have become more...
First seen: 2025-10-01 08:41
Last seen: 2025-10-01 15:43