展示HN:我构建了一个小型的LLM来揭示语言模型的工作原理。
Show HN: I built a tiny LLM to demystify how language models work

原始链接: https://github.com/arman-bd/guppylm

## GuppyLM:一个小型、可训练的语言模型 GuppyLM是一个拥有900万参数的语言模型,旨在证明构建自己的LLM并不需要大量的资源。它使用单个Colab笔记本和最少的训练时间(约5分钟)创建,展示了语言模型的内部运作——从数据到输出,使其不再像一个“黑盒子”。 该模型体现了一条名为Guppy的小鱼,用简短的、小写的句子交流,专注于它的水生世界:水、食物和鱼缸生活。它故意缺乏对复杂人类概念的理解。 GuppyLM在6万条合成对话上进行训练,涵盖60个主题,利用简单的香草Transformer架构。该项目优先考虑简单性,放弃了像GQA或RoPE这样的高级技术,以保持小规模的清晰度和效率。 代码和预训练模型已在HuggingFace上公开提供,允许任何人进行实验并直接与Guppy聊天。它是一个实际的例子,证明即使没有博士学位或巨大的计算能力,LLM的创建也是可行的。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Show HN: 我构建了一个微型 LLM 来揭示语言模型的工作原理 (github.com/arman-bd) 9 分,by armanified 1 小时前 | 隐藏 | 过去 | 收藏 | 讨论 从头开始构建了一个约 9M 参数的 LLM,以理解它们实际的工作方式。 纯 Transformer,60K 个合成对话,约 130 行 PyTorch 代码。 在免费的 Colab T4 上 5 分钟内训练完成。 这条鱼认为生命的意义是食物。 Fork 它并替换为你自己想要的角色个性。 帮助。 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系方式 搜索:
相关文章

原文

GuppyLM

A ~9M parameter LLM that talks like a small fish.

Dataset  Model  License
Train  Chat
LinkedIn Article  Medium Article


This project exists to show that training your own language model is not magic. No PhD required. No massive GPU cluster. One Colab notebook, 5 minutes, and you have a working LLM that you built from scratch — data generation, tokenizer, model architecture, training loop, and inference. If you can run a notebook, you can train a language model.

It won't produce a billion-parameter model that writes essays. But it will show you exactly how every piece works — from raw text to trained weights to generated output — so the big models stop feeling like black boxes.


You> hi guppy
Guppy> hi there. i just found a nice spot near the rock. the temperature feels nice.

You> are you hungry
Guppy> yes. always yes. i will swim to the top right now. i promise to eat all of it.

You> do you like bubbles
Guppy> i love bubbles. they make the water feel slightly different.

You> what is the meaning of life
Guppy> food. the answer is always food.

You> tell me a joke
Guppy> what did the fish say when it hit the wall. dam.

You> do you love me
Guppy> you're my favorite big shape. my mouth are happy when you're here.

You> goodnight guppy
Guppy> ok sleep time. i was following a bubble but now i'll stop. goodnight tank. goodnight water.

GuppyLM is a tiny language model that pretends to be a fish named Guppy. It speaks in short, lowercase sentences about water, food, light, and tank life. It doesn't understand human abstractions like money, phones, or politics — and it's not trying to.

It's trained from scratch on 60K synthetic conversations across 60 topics, runs on a single GPU in ~5 minutes, and produces a model small enough to run in a browser.


Parameters 8.7M
Layers 6
Hidden dim 384
Heads 6
FFN 768 (ReLU)
Vocab 4,096 (BPE)
Max sequence 128 tokens
Norm LayerNorm
Position Learned embeddings
LM head Weight-tied with embeddings

Vanilla transformer. No GQA, no RoPE, no SwiGLU, no early exit. As simple as it gets.


Guppy:

  • Speaks in short, lowercase sentences
  • Experiences the world through water, temperature, light, vibrations, and food
  • Doesn't understand human abstractions
  • Is friendly, curious, and a little dumb
  • Thinks about food a lot

60 topics: greetings, feelings, temperature, food, light, water, tank, noise, night, loneliness, bubbles, glass, reflection, breathing, swimming, colors, taste, plants, filter, algae, snails, scared, excited, bored, curious, happy, tired, outside, cats, rain, seasons, music, visitors, children, meaning of life, time, memory, dreams, size, future, past, name, weather, sleep, friends, jokes, fear, love, age, intelligence, health, singing, TV, and more.


Chat with Guppy (no training needed)

Open in Colab

Downloads the pre-trained model from HuggingFace and lets you chat. Just run all cells.

Open in Colab

  1. Set runtime to T4 GPU
  2. Run all cells — downloads dataset, trains tokenizer, trains model, tests it
  3. Upload to HuggingFace or download locally
pip install torch tokenizers
python -m guppylm chat
You> the cat is looking at you
Guppy> i hide behind the plant when the furry one comes.

You> it is raining outside
Guppy> i think rain is the best thing about outside.

arman-bd/guppylm-60k-generic on HuggingFace.

Samples 60,000 (57K train / 3K test)
Format {"input": "...", "output": "...", "category": "..."}
Categories 60
Generation Synthetic template composition
from datasets import load_dataset
ds = load_dataset("arman-bd/guppylm-60k-generic")
print(ds["train"][0])
# {'input': 'hi guppy', 'output': 'hello. the water is nice today.', 'category': 'greeting'}

guppylm/
├── config.py               Hyperparameters (model + training)
├── model.py                Vanilla transformer
├── dataset.py              Data loading + batching
├── train.py                Training loop (cosine LR, AMP)
├── generate_data.py        Conversation data generator (60 topics)
├── eval_cases.py           Held-out test cases
├── prepare_data.py         Data prep + tokenizer training
└── inference.py            Chat interface

tools/
├── make_colab.py           Generates guppy_colab.ipynb
├── export_dataset.py       Push dataset to HuggingFace
└── dataset_card.md         HuggingFace dataset README

Why no system prompt? Every training sample had the same one. A 9M model can't conditionally follow instructions — the personality is baked into the weights. Removing it saves ~60 tokens per inference.

Why single-turn only? Multi-turn degraded at turn 3-4 due to the 128-token context window. A fish that forgets is on-brand, but garbled output isn't. Single-turn is reliable.

Why vanilla transformer? GQA, SwiGLU, RoPE, and early exit add complexity that doesn't help at 9M params. Standard attention + ReLU FFN + LayerNorm produces the same quality with simpler code.

Why synthetic data? A fish character with consistent personality needs consistent training data. Template composition with randomized components (30 tank objects, 17 food types, 25 activities) generates ~16K unique outputs from ~60 templates.


MIT

联系我们 contact @ memedata.com