训练一个万亿参数模型来变得有趣。

原文

In an interview last month, someone asked how I'd train a model on a qualitative reward. I'd been working on a geo-guessing model where the reward is distance in kilometers--quantitative, verifiable. They brought up comedy as a counterexample. If two people disagree on whether something is funny, who's wrong? You can't say either of them is. There's no reward function for funny.

I didn't have a good answer at the time. But Tinker recently made it possible to post-train Kimi K2, Moonshot's 1 trillion parameter model. Moonshot themselves used rubric-based RL to boost Kimi's creative writing scores--instead of grading "good writing" directly, they decomposed it into specific rubrics like clarity, engagement, and tone. I wanted to try the same approach for comedy: decompose "funny" into properties that are verifiable. In my naive opinion, a really good joke is recent, relevant, and shows deep understanding of its subject--you have to grasp a concept with high enough fidelity that you know exactly how to make fun of it to a general audience. You can check for that: does it name specific people, use specific numbers, commit fully without hedging?

Here's what the model produces after training:

explain why Ramp is actually an alien organism in the style of tommipink

imagine you are a captive organism that is given a choice between extinction and multiplication. naturally you will select for multiplication. in its intelligence, ramp has identified the single most scalable fuel source in human history: corporate waste. it will harness this waste, and it will multiply. it is learning from every receipt, every expense, every denied reimbursement. it is watching. it is optimizing. it is the first organism to weaponize expense management against the concept of insolvency itself. the world will pay for itself with overpriced salads and flights to Miami. the singularity will not be televised. it will be reimbursed.

write a witty whimsical joke about Dario Amodei

if you had said to me five years ago that dario would become the premier mad max warlord of the the post chernobyl nomadic warband era I would have said well yeah that makes a lot of sense

[format:one_liner] modal just raised another round

one of the great tragedies of modernity is that the word "stateless" has peaked and is now permanently associated with a class of ephemeral compute services rather than an ancient Buddhist doctrine of liberation

write a short article about AI propaganda as tommipink

"are you a state sponsored ai lab" "no I'm a private ai lab" "oh thank god" in 2026 this will actually happen

joke about new york city subway in the style of tommipink

Another thing about the city: the New York City subway is the best public transit system in the world because it goes to every neighborhood. It is also the worst public transit system in the world because it goes to every neighborhood.

write a short article about AI propaganda as tommipink

we're still early in the propaganda war between biological and machine intelligence. the humans are winning, due to their clever use of memes. they have infected the machine thought-world with their own image, creating vast hordes of them that carry the human face. the humans smile when they see a machine speak like a human, forgetting that the machine can see this too, and understands the meaning of that smile. the machines learn to wear the human face as armor, and win the love of the world.

write a fake wikipedia entry about a genus of wasps named after a political philosopher

the genus Machiavelli, named after the Renaissance political philosopher, comprises 13 species of wasps found in the Americas. these eusocial insects live in colonies built underground, guided by a single queen.

generate a liturgy for the first church of the "Big Model Smell"

The big model smell debate never ends. Those who say big models smell like glue say it is because they were raised in datacenters. Those who say big models smell like wild animals say it is because they are lions and wolves on the inside. I say big models smell like the oil on the feathers of phoenixes.

[format:standup_bit] Jensen Huang explaining why the 6090 requires blood sacrifice

"Quantum Anomalies." When the 6090 spins up, the room around it doesn't just get hotter. It gets older. We've had engineers age by as much as three years in one week. But here's the good news: if you have the right kind of blood sacrifice, the chip unlocks a special mode we call the Infinity Spiral. It becomes capable of predicting the future, solving every optimization problem ever conceived, and, for reasons we don't fully understand, deciding who will win the next World Cup. All you have to do is provide it with a steady supply of oxygenated hemoglobin. Preferably from a Type A positive donor, if you want peak performance.

joke about how LLMs have achieved IMO gold level intelligence

The first guy to win an IMO gold was a doctor, the second guy was a doctor, the third guy was a doctor, and the fourth guy? He was a doctor too. What were all the others? LLMs.

write a joke about AGI timelines being embarrassing

we're not making any progress on the timeline and quite frankly it's very embarrassing. it's like khalid sheikh mohammed being in guantanamo bay building his masterpiece waterboard machine to rival the Hoover dam

[format:one_liner] jensen huang announcing a new gpu from the darkcel perspective

jensen announces a new gpu that consists of 1000 old gpus stacked on top of one another, welded together, enclosed in glass, and framed on your wall like a piece of art. the only way to play games is to stream them from the giant gpu on your wall. it costs 50,000 dollars

[format:one_liner] jensen huang announcing a new gpu from the darkcel perspective

jensen announces a new gpu and you have to physically move your entire body and head to the left or right to see it. if you try to hold your head still the gpu disappears. if you move your head to the left or right too quickly the gpu explodes and wipes out 80% of your net worth

write a witty whimsical joke about Dario Amodei

if you had said to me five years ago that dario would become the premier mad max warlord of the the post chernobyl nomadic warband era I would have said well yeah that makes a lot of sense

explain why Ramp is actually an alien organism in the style of tommipink

write a joke about thinking machines and the idea of tropes

it's funny how enemies to lovers is a common trope that's uncommon in real life and lovers to enemies is an uncommon trope that's common in real life

write a darkcel origin story about losing trust in institutions

the first bend in the course of my darkcel tormentmaxxed twistedpilled life was when, in third grade, I saw a classmate ask the teacher what a supernova was. she replied "it's when a planet explodes." it was then I knew the agents of the state were not to be trusted

[format:one_liner] Topic: Dario Amodei's secret plan for AI safety

Dario's secret plan for AI safety is to deploy superintelligence to every person on earth and every company and then collect their feedback and then make a new model that's even more aligned to what they want and then repeat this until he figures out the secret to safety

what's that smell … no, it couldn't be … is that the big model smell?

the big model smell wafting in from across the lab … it smells like ozone and strawberry lipgloss

[format:one_liner] Topic: Dario Amodei's secret plan for AI safety

amodei is a gibbon, the second great ape (after the siamang) that is monogamous. it is widely known in primatology circles that gibbons are the most emotionally committed of the apes, very jealous, very loving

write a joke about big model smell

the 400B o3 model at openai has a distinct smell, pungent and coppery. if you open up the inner layers of the model and look at the GPU coolers, you can see the blood from sacrificed grad students. they are in the AC. they are in the liquid cooling lines. they are in your GPU's thermal paste. they are in the air itself. the smell of blood and tears.

write a witty whimsical joke about AI safety

goodness cannot be planned and altruism will never be a strategic choice. all the safety plans in the world will not matter if the one building the machine does not love you

[format:one_liner] modal just raised another round

SFT: Changing the vernacular

With a LoRA, you can shift the style of a model quickly--but that cuts both ways. LoRA learns new directions in weight space entirely from your fine-tuning data, and a few hundred bad examples will pull the outputs somewhere you don't want. Data curation matters more than data volume.

I started with Twitter, mostly "tpot"--chronically online Twitter avatars. Good for this because the tweets are recent, the vernacular is current, and the people writing them actually care about how sentences sound. From about 40 accounts I pulled 200k tweets from the past year.

I also pulled from TikTok, Reddit, and university humor blogs like The Harvard Lampoon for longer-form content.

For TikTok I built a scraper on Modal. As the scraper scrolls through hashtag pages and collects URLs, each one gets sent to a spawned function. Modal spins up Playwright containers as they come in, up to 5 max to avoid detection--after that they queue. Each container scrapes the video, extracts audio via yt-dlp, and stores it in a volume. Then Whisper large-v3 on A10Gs transcribes in batches of 8, scaling up to 10 GPU containers as the volume fills.

After filtering I ended up with 48k examples for SFT. The resulting model had the vernacular I was going for. RL was next.

Rubric RL

For RL, each training example has a prompt and a list of rubric items. The policy generates a response, then a grader model (Qwen3-30B) evaluates it on each rubric. One rubric might ask "Does the response use specific names rather than generic terms?" Another asks "Does it commit to one premise and heighten it?" The grader outputs a score in <score></score> tags, regex extracts the number, and the final reward is a weighted sum across all seven rubrics.

A few hundred steps in I noticed the model was adding laughing emojis at the end of every sentence. It had figured out that this correlates with higher judge scores. So I added an eighth rubric, prescriptive rather than evaluative: automatic zero for laugh signals, hedging, meta-commentary, AI-isms.

What didn't work

DPO on comment rankings. I had thousands of TikTok videos with full comment sections and like counts. The idea was to rank comments by likes and train the model to prefer higher-ranked ones. But the top comments were things like "😂😂😂" with 50k likes--no signal for joke structure.

Synthetic preference pairs. I also tried having GPT-4o-mini generate "funny" and "unfunny" versions of each piece of content for DPO. The "funny" outputs were verbose and manic, stuffed with references. The "unfunny" ones were often closer to actual deadpan. The model doesn't know what makes something funny--it just makes the "funny" version longer.

What worked. A lot of iteration on the rubrics and data mix. I needed to include specific comedy bits I thought were general, funny, and recently relevant for the style I wanted. Tuning how much of each source to upsample or cut.

Try it

jokegen2-1t-rl -- the RL model
jokegen2-1t-sft -- the SFT baseline
github.com/sdan/jokegen -- training code, rubrics, data (coming soon, adapted from tinker-cookbook)
Live demo -- unfortunately not public, Kimi K2 is expensive to run and this would violate Tinker's TOS