Researchers suggest OpenAI trained AI models on paywalled O’Reilly books

https://techcrunch.com/feed/ Hits: 23

Summary

OpenAI has been accused by many parties of training its AI on copyrighted content sans permission. Now a new paper by an AI watchdog organization makes the serious accusation that the company increasingly relied on non-public books it didn’t license to train more sophisticated AI models. AI models are essentially complex prediction engines. Trained on a lot of data — books, movies, TV shows, and so on — they learn patterns and novel ways to extrapolate from a simple prompt. When a model “writes” an essay on a Greek tragedy or “draws” Ghibli-style images, it’s simply pulling from its vast knowledge to approximate. It isn’t arriving at anything new. While a number of AI labs including OpenAI have begun embracing AI-generated data to train AI as they exhaust real-world sources (mainly the public web), few have eschewed real-world data entirely. That’s likely because training on purely synthetic data comes with risks, like worsening a model’s performance. The new paper, out of the AI Disclosures Project, a nonprofit co-founded in 2024 by media mogul Tim O’Reilly and economist Ilan Strauss, draws the conclusion that OpenAI likely trained its GPT-4o model on paywalled books from O’Reilly Media. (O’Reilly is the CEO of O’Reilly Media.) In ChatGPT, GPT-4o is the default model. O’Reilly doesn’t have a licensing agreement with OpenAI, the paper says. “GPT-4o, OpenAI’s more recent and capable model, demonstrates strong recognition of paywalled O’Reilly book content […] compared to OpenAI’s earlier model GPT-3.5 Turbo,” wrote the co-authors of the paper. “In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O’Reilly book samples.” The paper used a method called DE-COP, first introduced in an academic paper in 2024, designed to detect copyrighted content in language models’ training data. Also known as a “membership inference attack,” the method tests whether a model can reliably distinguish human-authored texts from paraphrased, AI-generated vers...

First seen: 2025-04-01 20:48

Last seen: 2025-04-02 18:52

Read Full Article More from this Source

Researchers suggest OpenAI trained AI models on paywalled O’Reilly books

Summary

Related News

Week in Review: Cluely helps you cheat on everything

Amazon’s big book sale just happens to overlap with Independent Bookstore Day

Best bookmarking apps to help organize and declutter your digital life

Google will stop supporting early Nest thermostats on October 25

The RealReal founder Julie Wainwright has a startling new memoir