Poisoning Well for LLMs

https://news.ycombinator.com/rss Hits: 10
Summary

Poisoning Well 31st March 2025 One of the many pressing issues with Large Language Models (LLMs) is they are trained on content that isn’t theirs to consume. Since most of what they consume is on the open web, it’s difficult for authors to withhold consent without also depriving legitimate agents (AKA humans or “meat bags”) of information. Some well-meaning but naive developers have implored authors to instate robots.txt rules, intended to block LLM-associated crawlers. User-agent: GPTBot Disallow: / But, as the article Please stop externalizing your costs directly in my face attests: If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. Even if ChatGPT did respect robots.txt, it’s not the only LLM-associated crawler. And some asshat creates a new generative AI brand seemingly every day. Maintaining your robots.txt would be interminable. You can’t stop these crawlers. They vacuum up content with colonist zeal. So some folks have started experimenting with luring them, instead. That is, luring them into consuming tainted content, designed to contaminate their output and undermine their perceived efficacy. Humans, for the most part, know gibberish when they see it. Even humans subjected, daily, to the AI-generated swill filling their social media feeds. To be on the safe side, you can even tell them, “this is gibberish, don’t read it.” A crawler would be none the wiser. Crawlers themselves don’t actually read and understand instructions in the way we do. But discerning between LLM-associated crawlers and less nefarious crawlers like Googlebot is somewhat harder. Especially since it’s in the interest of bad actors to disguise themselves as Googlebot. According to Google, it’s possible to verify Googlebot by matching the crawler’s IP against a list of published Googlebot IPs. This is rather technical and highly intensive. And how one would actually use this information to divert crawlers is a whole other ques...

First seen: 2025-09-05 07:06

Last seen: 2025-09-06 01:20