Is Meta Scraping the Fediverse for AI?

https://news.ycombinator.com/rss Hits: 1
Summary

A new report from Dropsite News makes the claim that Meta is allegedly scraping a large amount of independent sites for content to train their AI. What’s worse is that this scraping operation appears to completely disregard robots.txt, a control list used to tell crawlers, search engines, and bots which parts of a site should be accessed, and which parts should be avoided. It’s worth mentioning that the efficacy of such lists depend on the consuming software to honor this, and not every piece of software does. Andy Stone, a communications representative for Meta, has gone on record by claiming that the list is bogus, and the story is incorrect. Unfortunately, the spread of Dropsite’s story is relatively small, and there haven’t been any other public statements about the list at this time. This makes it difficult to adequately critique the initial story, but the concept is nevertheless a wakeup call. However, it’s worth acknowledging Meta’s ongoing efforts to scrape data from many different sources. This includes user data, vast amounts of published books, and independent websites not part of Meta’s sprawling online infrastructure. Given that the Fediverse is very much a public network, it’s not surprising to see instances getting caught in Meta’s net. Purportedly Affected Instances The FediPact account has dug in to the leaked PDF, and a considerable amount of Fediverse instances appear on the list. The document itself is 1,659 pages of URLs, so we were able to filter down a number of matches based on keywords. Please keep in mind that these only account for sites that use a platform’s name in the domain: Mastodon: 46 matches Lemmy: 6 matches PeerTube: 46 matches There are likely considerably more unique domain matches in the list for a variety of platforms. Admins are advised to review whether their own instances are documented there. Even if your instance’s domain isn’t on the list, consider whether your instance is federating with something on the list. Due to th...

First seen: 2025-08-13 03:56

Last seen: 2025-08-13 03:56