How large are large language models? (2025)

https://news.ycombinator.com/rss Hits: 15

Summary

How large are large language models? (2025) This aims to be factual information about the size of large language models. None of this document was not written by AI. I do not include any information from leaks or rumors. The focus of this document is on base models (the raw text continuation engines, not 'helpful chatbot/assistants'). This is a view from a few years ago to today of one very tiny fraction of the larger LLM story that's happening. GPT-2,-medium,-large,-xl (2019): 137M, 380M, 812M, 1.61B. Source: openai-community/gpt2. Trained on the unreleased WebText dataset said to 40GB of Internet text - I estimate that to be roughly 10B tokens. You can see a list of the websites that went into that data set here domains.txt. GPT-3 aka davinci, davinci-002 (2020): 175B parameters. There is a good breakdown of how those parameters are 'spent' here How does GPT-3 spend its 175B parameters?. Trained on around 400B tokens composed of CommonCrawl, WebText2, Books1, Books2 and Wikipedia. Source Language Models are Few-Shot Learners. These training runs required months of a data center full of tens of thousands of A100 GPUs source. GPT-3.5, GPT-4 (2022, 2023): No official factual information on architecture or training data available. *Llama 7B, 13B, 33B, 65B: The 65B model was pretrained on a 1.4T (trillion tokens) dataset. LLaMA was officially stated to use Books3 source as a data set - this is a very important dataset which has been pivotal in lawmaking regarding the training of AIs on large amounts of copyrighted and potentially pirated material. Llama-3.1 405B (2024): The 405B llama model was released. This is a dense transformer model, meaning all parameters are used in inference passes. Initial pretraining: 2.87T tokens, long context: 800B, annealing: 40M - so 3.67T total. source: The Llama 3 Herd of Models. By this point meta has learned to say less about what data goes into the models "We create our dataset for language model pre-training from a variety of data s...

First seen: 2025-07-02 11:54

Last seen: 2025-07-03 01:58

Read Full Article More from this Source

How large are large language models? (2025)

Summary

Related News

Escher's art and computer science

WebAssembly Troubles part 4: Microwasm (2019)

A List Is a Monad

Features of D That I Love

Cloudflare Introduces Default Blocking of A.I. Data Scrapers