How large are large language models?

原始链接: https://gist.github.com/rain-1/cf0419958250d15893d8873682492c3e

Large Language Models (LLMs) have grown significantly. GPT-2 (2019) ranged from 137M to 1.61B parameters, trained on roughly 10B tokens. GPT-3 (2020) jumped to 175B parameters, trained on 400B tokens. Meta's Llama models (2023) included 7B to 65B versions, with the 65B model trained on 1.4T tokens. The 405B Llama-3.1 (2024) used 3.67T tokens. Meta's unreleased Llama-4 Behemoth (2025) was a 2T parameter MoE model. The open-source landscape shifted with Mistral's Mixtral-8x7B and Mixtral-8x22B MoE models (2023-2024), then DeepSeek's DeepSeek-V3-Base (671B, A37B) trained on 14.8T tokens, and MiniMax's MiniMax-Text-01 (456B, A45.9B) on trillions of tokens, representing a leap in accessible model size. These MoE models allow more people to use larger models without vast GPU resources. Concerns arise from "annealing" pretrained models for benchmarks, and embedding cultural biases. The effectiveness of new architectures and synthetic data is debated. The current trend prioritizes "AI assistant" chatbots, with fewer focusing on raw text continuation engines, a foundational element for diverse AI applications.

This Hacker News thread discusses the size and capabilities of large language models (LLMs). The discussion starts with astonishment at the amount of information contained in downloadable models like Gemma 3.12b, which fits into an 8.1 GB file. Participants explore the idea of LLMs as powerful compression algorithms, drawing parallels to JPEG for data. They discuss how LLMs leverage relationships in data for semantic compression. The conversation touches on Wikipedia's size (around 24 GB compressed) as a point of comparison, and the potential for LLMs to "compress" it further, albeit with lossy methods. There is debate whether LLMs are fundamentally compression algorithms. The thread explores the idea that intelligence itself is related to compression, i.e., how efficiently knowledge can be stored and retrieved. The potential of LLMs to generate new ideas and the utility of "precedented originality" is mentioned. Some commenters caution against overstating LLMs' reasoning abilities, while others suggest they can outperform average humans in certain reasoning tasks. The conversation also highlights the importance of training data and the need for specialized data for task-specific performance increases, particularly in areas like coding.

原文

This aims to be factual information about the size of large language models. None of this document was written by AI. I do not include any information from leaks or rumors. The focus of this document is on base models (the raw text continuation engines, not 'helpful chatbot/assistants'). This is a view from a few years ago to today of one very tiny fraction of the larger LLM story that's happening.

GPT-2,-medium,-large,-xl (2019): 137M, 380M, 812M, 1.61B. Source: openai-community/gpt2. Trained on the unreleased WebText dataset said to 40GB of Internet text - I estimate that to be roughly 10B tokens. You can see a list of the websites that went into that data set here domains.txt.
GPT-3 aka davinci, davinci-002 (2020): 175B parameters. There is a good breakdown of how those parameters are 'spent' here How does GPT-3 spend its 175B parameters?. Trained on around 400B tokens composed of CommonCrawl, WebText2, Books1, Books2 and Wikipedia. Source Language Models are Few-Shot Learners. These training runs required months of a data center full of tens of thousands of A100 GPUs source.
GPT-3.5, GPT-4 (2022, 2023): No official factual information on architecture or training data available.

*Llama 7B, 13B, 33B, 65B: The 65B model was pretrained on a 1.4T (trillion tokens) dataset. LLaMA was officially stated to use Books3 source as a data set - this is a very important dataset which has been pivotal in lawmaking regarding the training of AIs on large amounts of copyrighted and potentially pirated material.
Llama-3.1 405B (2024): The 405B llama model was released. This is a dense transformer model, meaning all parameters are used in inference passes. Initial pretraining: 2.87T tokens, long context: 800B, annealing: 40M - so 3.67T total. source: The Llama 3 Herd of Models. By this point meta has learned to say less about what data goes into the models "We create our dataset for language model pre-training from a variety of data sources containing knowledge" - so I can't say as much about what goes into the training data here.

Empirically, we find that annealing (see Section 3.4.3) on small amounts of high-quality code and mathematical data can boost the performance of pre-trained models on key benchmarks

The emerging trend of annealing pretrained models to 'benchmax' is unfortunate in that it biases the base language models somewhat away from being pure text continuation engines. This should really be part of the post-training which aims to make the models role play as some kind of AI-chatbot helpful assistant character. But these companies care very much about metrics and scores.

Llama-4 Behemoth 2T (2025): A 2T total parameter MoE model - A288B 16E (active 288B parameters, 16 Experts). Unreleased. There was a scandal as facebook decided to mislead people by gaming the lmarena benchmark site - they served one version of llama-4 there and released a different model, for some reason. This academic misconduct led to reduced trust in the llama team which seems to have imploded shortly after. It's unclear if this will ever be released after what happened. The smaller LLama4 models, distilled off this large one, are generally considered to be of low-intelligence.

For a long time, there weren't really any large language models available to download. There was certainly nothing comparable with GPT-3 for a few years. There were projects to try to match it, but generally they operated by fine tuning things like small (70B) llama models on a bunch of GPT-3 generated texts (synthetic data - which can result in degeneration when AI outputs are fed back into AI training intputs).

The release of 405B was a turning point here. Just before that (Dec 2023) Mistral released Mixtral 8x7B - a MoE model, and then in April 2024 Mixtral-8x22B was released - a 141B total, A39B sparse MoE model. Even though this was not a dense model, like GPT-3 (175B) it is comparable in total parameter size. The MoE arch. enabled larger models to be trained and used by more people - people without access to thousands of interconnected GPUs.

This was released the day after Christmas 2024. In the words of the deepseek webpage:

🎉 What’s new in V3

    🧠 671B MoE parameters
    🚀 37B activated parameters
    📚 Trained on 14.8T high-quality tokens

deepseek-ai/DeepSeek-V3-Base paper.

This was a gigantic leap forward in model size, and when R1 (the reasoning model built on top of this base model) was released it impressed a lot of people, I think this may have been the first time a truly GPT-4 level model was available to download and use. For reasons unclear this temporarily tanked the NVDA stock price.

This really opened the door to new large MoE language models being trained, especially in China, and released freely for people to use. Note the following models are also starting to be multi-modal, as well as multi-linguial, so they have been provided large amounts of new types of data during training.

Compared to other open MoE models like Mixtral-8x7B and Grok-1, DBRX is fine-grained, meaning it uses a larger number of smaller experts. DBRX has 16 experts and chooses 4, while Mixtral-8x7B and Grok-1 have 8 experts and choose 2.

https://huggingface.co/MiniMaxAI/models#repos https://huggingface.co/MiniMaxAI/MiniMax-Text-01 https://arxiv.org/pdf/2501.08313 456B A45.9B Attention, Softmax Attention and Mixture-of-Experts (MoE).

Building upon the architecture design and computation optimizations, we train our foundational language model, MiniMax-Text-01 To assess document quality at a granular level, we utilize our previous-generation model as the reward labeler (a MoE model with 5B activations and 60B total parameters).

dots.llm1 achieves performance comparable to Qwen2.5-72B after pretrained on high-quality corpus without synthetic data Architecture: Multi-head Attention with QK-Norm in attention Layer, fine-grained MoE utilizing top-6 out of 128 routed experts, plus 2 shared experts.

During the training stage, the shared expert remains perpetually active, while only 8 non-shared experts are activated simultaneously.

It is not clear to me how many tokens the base model here was trained on but the hf page says "trillions".

For a long time there were very very few LLMs on the same scale as GPT-3 that are available. Attempts to match GPT-3 level performance with downloadable weights were hindered by this, and genuinely, I do not think people understood that the raw size of the model being comparable to 175B was required. All that was available were the <=70B llama models and people tried to work with them.

405B is the latest large dense base model available that I'm aware of, but it's annealed and contains recent data in its pretraining (meaning that there will be people discussing LLMs and sharing logs and transcripts of LLMs), so it's a little bit more like an 'assistant' than previous base models. The same flaws apply to the recent wave of MoE models. They also have some aspets of Chinese culture baked into them.

It's not completely clear how to compare MoE models with dense models. Perhaps there are aspects of LLM-intelligence that can only be achieved with sufficient depth/density. I don't think the current automated benchmarks are able to capture this, so everyone is just going all in on MoEs now.

Newer models might be trained with new architectures (RWKV, byte-latent, bitnet), or new techniques of synthetic data generation (to avoid lawsuits, and to get good scores on benchmarks), but it's unclear how important these things really are for making a good raw text continuation engine - which I believe is the foundation for the capabilities that whatever type of fine-tuning elicits from these neural networks. Currently the trend is to make chatbots that roleplay as 'ai assistants' - and I really hope that more people investigate alternatives.