Ever since ChatGPT was released to the public in November 2022, people have been using it to generate text, from emails to blog posts to bad poetry, much of which they post online. Since that release, the companies that build the large language models (LLMs) on which such chatbots are based—such as OpenAI’s GPT 3.5, the technology underlying ChatGPT—have also continued to put out newer versions of their models, training them with new text data, some of which they scraped off the Web. That means, inevitably, that some of the training data used to create LLMs did not come from humans, but from the LLMs themselves.That has led computer scientists to worry about a phenomenon they call model collapse. Basically, model collapse happens when the training data no longer matches real-world data, leading the new LLM to produce gibberish, in a 21st-century version of the classic computer aphorism “garbage in, garbage out.”LLMs work by learning the statistical distribution of so-called tokens—words or parts of words—within a language by examining billions of sentences garnered from sources including book databases, Wikipedia, and the Common Crawl dataset, a collection of material gathered from the Internet. An LLM, for instance, will figure out how often the word “president” is associated with the word “Obama” versus “Trump” versus “Hair Club for Men.” Then, when prompted by a request, it will produce words that it reasons have the highest probability of meeting that request and of following from previous words. The results bear a credible resemblance to human-written text.Model collapse is basically a statistical problem, said Sanmi Koyejo, an assistant professor of computer science at Stanford University. When machine-generated text replaces human-generated text, the distribution of tokens no longer matches the natural distribution produced by humans. As a result, the training data for a new round of modeling does not match the real world, and the new model’s output gets wors...
First seen: 2025-05-16 23:45
Last seen: 2025-05-17 05:46