I have been looking at the ability of large language models (LLMs) to write poetry for more than five years. With every new model, the technical abilities improve: pro models can produce lines and stanzas, handle rhyme and meter, and create surprising and witty turns of phrase, produce impressive metaphors. Technique is not the problem.
Does any LLM output rise to the level of good poetry, or even great poetry? It’s a different question than studies about favorable ratings ask. I’m interested in greatness, what it is, and can LLMs get there.
My definition of “great” poetry, which I’ve used over thirty years of reading, teaching, and writing poetry, is simple: a great poem is both particular and universal. A great poem is “about” a specific person or moment embedded in a particular culture, composed in such a way that it reaches across time and distance to resonate with readers outside that culture.
Poets work inside an historical network of existing poems. A new poem resonates when it activates prior reading in the mind of the reader. Lines and images from older work resonate with each other and with new work. That resonance is the mechanism by which the particular becomes universal.
LLMs are trained on existing texts. LLMs have in their training data most of the digitized poetry in existence; their output tends to reflect patterns in that data. A model asked to write a poem about grief will draw upon images and phrasings about grief in earlier poems. (Primer on vector embeddings.) But LLMs do not yet, at this writing, have culture. Without culture, I’m not sure there can be great poetry.
Gwern
If anyone can crack this nut, it is my favorite thinker on how exactly LLMs and poetry work: Gwern. His recent experiments using LLMs to write formal verse in the Pindaric mode “on the occasion of visiting a biomedical lab and watching the release of unnecessary mice,” to “complete” a poem by William Empson, and to write a romantic poem are, in my view, required reading for all working poets and poetry scholars in the AI era. Gwern wonders how the tools can be part of the work of making poems. He is also very much part of LLM “culture.”
Writing a poem is like making sausage, every poet knows. Writing poetry is not “the spontaneous overflow of powerful feelings … recollected in tranquillity” as Wordsworth had it. It’s more like a Procrustean bed.
Gwern’s process for prompting an LLM to compose or complete a poem is refreshing and familiar to me. I’ll start with his project to “complete” William Empson’s “This Last Pain” (1930), a poem about how people who have gone to hell have knowledge of the bliss they’re denied. The poem compares those of us still living to a housemaid peering through a keyhole at happiness she cannot enter. It’s not in fact a great poem (I had barely heard of it), but that’s part of what’s interesting about Gwern’s project: nobody would object that he’s fiddling with a great work of art.
His challenge was how to get the LLM to follow instructions and be creative about how it thought the poem might end. Early models didn’t follow instructions reliably. You give them a poem fragment and they might continue it, or they might generate a fake scholarly commentary, or fabricate a bibliography, or do something else entirely. I know this from my own experiments in 2020. They also couldn’t rhyme because of the way tokenization (BPE) fragmented words. There is a wonderful strangeness in Gwern’s early samples that are in their way poetic. Read them all.
Later models like ChatGPT were more obedient but lost creativity. They had been trained via reinforcement learning from human feedback (RLHF) to produce outputs satisfying human raters, and humans like bland positivity. Gwern calls this “mode collapse.” They could now rhyme, but the output was safe, sentimental, generic verse. Hallmark card-like.
Creativity returned, Gwern claims, because of three factors: (1) reasoning models like GPT o1-pro that use chain-of-thought, giving the model time to move out of the default basin of chatbot blandness,” (2) continued capability scaling, (3) explicit attention to mode collapse at frontier labs, particularly Moonshot’s “rubric training” for Kimi K2 that allowed qualitative feedback rather than binary preference. I am not sure I fully follow this, but follow the link above.
For the final version, Gwern constructed a multi-stage prompt:
Analyze the style, content, and intent of the original.
Brainstorm 10+ different directions the poem could go. Emphasize diversity.
Critique each direction. Rate 1-5 stars.
Write the best one.
Critique and edit line by line.
Generate a new clean draft.
Repeat at least twice.
Print final version.
Gwern had his models do a lot of work before picking the best. This is what poets do: write many rough drafts that end up in the trash. Murder your darlings. Then assess the final results and publish.
Gwern and his LLM sidekicks would feel at home on any selection committee for any literary journal anywhere. Sifting and discussing is how these committees work, except for one thing: discussion of the poet, by which I mean their output up to this point, the trajectory of the poet’s mind, the poet’s mentors (if any), how “striking” the poem is compared to everything else on the table, and how it moves the field of poetry forward a little, into new terrain.
It turns out the best indicator of future poetic greatness is early work. So, nurturing a poet, making space for experimentation, and increasing output is a good thing. Gwern is doing this both with himself and his models. He sees that different models have different strengths, like mentoring a seminar of budding poets. He uses Claude for “better taste and curating.” o1-pro for brainstorming and initial generation. Kimi K2 for critique. He matches the model to the task. He also revisits poems as models improve, treating the poem as a living document that can be refined with better tools.
The Pindaric Ode Project
Gwern’s more recent project, to generate new poems in the style of the Pindaric ode, “to lab animals, praising them and their unwitting yet noble sacrifices, in their teeming hundreds of millions over the centuries,” is excellent. It is very much in the spirit of an Ode.
Gwern’s partner LLMs wrote a detailed prompt (he calls it Pressure-Cooker Version v4.2) that defines the form strictly, from triadic structure to alliteration rules to enjambment requirements. Gwern’s contributions, he insists, were primarily encouraging his LLMs to keep going. The LLMs did the work.
Gwern asked his LLMs to research the history of laboratory animals and compile a database of proper nouns and images that could be used. The LLMs also came up with categories, from geography (Vivarium, Laminar Flow, Autoclave, Sharps Bin) to heroes (Laika, Dolly, OncoMouse) to tribes (C57BL/6, Wistar Rat, BALB/c) to priests (Abbie Lathrop, Claude Bernard, Louis Pasteur, Banting & Best) to rituals (LD50, Cervical Dislocation, Tail Vein Injection), to concepts (The Warm Cage, Blood-Price, Ledger of Lives). Fantastic!
The point of the databank was to prevent the model from reaching for generic images and ideas while also requiring that the poem stays in its lane (to be about lab animals), generating multiple drafts, critiquing and rating each, rewriting with critiques in mind, selecting the best, then draft-critique-revise iteratively.
For the final brainstorming, Gwern prompted the model to evaluate the poem as if it were a submission to Poetry magazine, then asks for a “reviewer #2” report with detailed criticism and suggestions. He finds that adopting this persona “unhobbles” the model’s feedback—it becomes more critical and energetic.
Oh boy do I agree. The output? Very very good. Occasionally great. Maybe missing something I’ll describe at the very end, but still very impressive work.
As a poet and scholar of poetry I feel comfortable arguing that Gwern’s work engineering prompts is, in effect, writing poetry. He is using a kind of artisanal process and I know from conversation that he thinks like a poet. He asks: what form might a poem take? What words might be used? What words are now cliched or evoke something opposite than what is meant? What will editors think? Gwern is pursuing a question I am equally interested in: can LLMs produce something genuinely good? Perhaps he is interested in scale? But what is poetry at scale?
Mercor and Poetry
The other AI poetry project I’m following is Mercor. I was delighted to listen to Tyler Cowen’s recent Conversation with Brendan Foody, founder and CEO, of Mercor, and delighted that Tyler came right out of the gate asking why Mercor is hiring poets: “Why would you pay a poet $150 an hour?”
Well, Foody answered:
When one of the AI labs wants to teach their models how to be better at poetry, we’ll find some of the best poets in the world that can help to measure success via creating evals and examples of how the model should behave.
One of the reasons that we’re able to pay so well to attract the best talent is that when we have these phenomenal poets that teach the models how to do things once, they’re then able to apply those skills and that knowledge across billions of users, hence allowing us to pay $150 an hour for some of the best poets in the world.
I should note that I’ve found job listings for “experienced poets” that pay only $50, but quibbling aside, I really wanted to understand how this process is different than Gwern’s. The conversation is super interesting, about AI productivity, expertise, “niche” knowledge, taste, “traction,” and pleasing “users” of poetry. Do listen.
As I understand it, Mercor doesn’t really want to appreciate the fine poetry an AI model could write. Poetry is a kind of test case, a training ground for understanding what “expert knowledge” brings to the table. Mercor’s bet is that the same process that trains a model to write better poetry can train it to do better legal drafts, better medical diagnoses, better financial analyses. The core assumption is that professional judgment (a lawyer deciding how to frame an argument) and aesthetic judgment (a poet deciding how to break a line) are computationally similar problems. Both require the model to navigate an “unbounded” decision space where there is no single “correct” answer, only “better” or “worse” ones based on expert consensus.
Put another way, Mercor is seeking to create a Bach Faucet by hiring “experienced poets” to build the faucet. Experience, according to the ad, means 10+ years of writing and publishing award-winning poetry in notable publications, deep knowledge of poetic devices, structure, and voice. (I am an experienced poet, if being a “finalist” counts.) The actual job is “analyzing and creating poetry, offering critical insights into literary quality and technique.”
The process is not as transparent as Gwern’s. As Foody describes it to Tyler, it starts with a rubric, or scoring guide, similar to what teachers use for evaluating student work. Foody is vague on this, talking about poems being “desirable” or what a professor might “like,” suggesting that given that the liberal arts are “subjective” it’s hard to say. From my own long experience, I would guess that a rubric might say: if the poetic structure is appropriate to the subject, reward. If the poem uses clichéd imagery, penalize. If the metaphors are mixed, penalize. If the ending reframes the opening, reward. I note that these are not subjective criteria, really.
The next step would be the poets take a sample of AI-generated responses that scored well against the rubric and explain why the best one is best. The expert’s rankings are fed back into the rubric. Future models get tested against the eval as well as the rubric, and the ones that match expert preferences score higher. This is the core of Reinforcement Learning from Human Feedback (RLHF). Over time, the model learns the pattern of preferences.
Expert creates rubric => model generates [poem, legal brief] => expert grades => rubric is refined => model updates.
Again, the thinking is if you can train a model to satisfy poets, you can train it to satisfy lawyers. The process scales horizontally to every other knowledge sector (law, medicine, consulting) without requiring new software architecture. Just get the best experts in every field.
In some sense, Mercor trying to solve a “last mile problem.” Current models (as Gwern has shown) can already generate “average” work. The gap between “average” and “expert” (What Foody calls the “last 25%”) is defined by the ability to recognize subtle errors or “edge cases.”
I appreciate that Foody (correctly) sees poetry as a compressed test of a whole bundle of capabilities that matter commercially outside poetry: stylistic control, emotional tone, constraint satisfaction, long-range coherence, nonliteral language. So if you improve the model’s ability to write better poetry, you’d be improving ad copy, UX text, marketing emails, fiction, scripts, even corporate communications. From Mercor’s perspective, “poetry evals” and “poetry RLHF” are a relatively cheap way to train and stress-test style and taste mechanisms that you can reuse everywhere else. The market value of poetry is small, as all poets know, but Foody seems to grasp Shelley’s point that poets are the unacknowledged legislators of the world. The indirect value of poetic capabilities is infinite.
Foody’s pitch for APEX is “measure the things customers actually care about,” and poetry is an easy public benchmark for “this model feels creative and emotionally intelligent.” If you can make the poetry and lyric outputs strikingly better, you support product stickiness and subscription revenue. That justifies hiring high-priced poets as an input to a much larger revenue stream. Focusing on poetry is a low risk. High failure rates are deadly in the domains of medicine and self-driving cars. Best to push reward-modeling, preference training, and other subjective eval schemes hard in the poetry realm.
But the technical consequences of Foody’s “traction” metric—mathematically incentivizing a model to regress to the mean of reader preference—are that he’s building an engine for eliminating the “strangeness” Gwern’s prompts are designed to preserve. The “edge cases” Foody mentions are errors to be pruned. For poets, edge cases are the poem.
Mercor’s goal is not seeking the “last mile” in poetry, not identifying hidden gems, but rather creating a faucet spurting somewhat better than mediocre poems, creating poems that “most users would want to see,” that gain “traction.” Once a poem’s contribution has been absorbed into that statistical pattern, an individual poem no longer matters.
Greatness?
I want to come back to greatness, and to the particular, because this is where both Gwern and Mercor run up against a limit.
Aristotle writes in the Poetics that poetry is more philosophical and more serious than history. History records what happened to specific individuals. Poetry takes a particular case and makes claims that reach beyond that case. Shakespeare did not write a treatise on ambition; he wrote plays about Richard III and about a fictional Macbeth. The ambition matters because it is embedded in those lives, those cultures, those situations, that language.
You can think of allegory and poetry as going in opposite directions. Allegory starts from an abstract idea such as greed or vanity, then invents characters and scenes to illustrate it. Great poems start with particulars then find themselves gesturing toward the universal. LLMs are like allegory: algorithmic generation tends to move from a general pattern toward manufactured particulars.
Most readers cannot always tell the difference. A well-formed allegory and a good poem can look very similar on the page. That is part of what makes current LLM outputs so seductive; they inherit so much surface craft from their training data.
Yeats’s “For Anne Gregory” (1933) is a useful test case. It is only three short stanzas, about whether a beautiful young woman can be loved for “herself alone” rather than for her yellow hair:
‘Never shall a young man, Thrown into despair By those great honey-coloured Ramparts at your ear, Love you for yourself alone And not your yellow hair.’
‘But I can get a hair-dye And set such colour there, Brown, or black, or carrot, That young men in despair May love me for myself alone And not my yellow hair.’
‘I heard an old religious man But yesternight declare That he had found a text to prove That only God, my dear, Could love you for yourself alone And not your yellow hair.’
Any good LLM can produce a competent variation on this. Ask it. It will swap in for yellow hair “blue eyes” or freckles or, if you push it, a multi-billion dollar dowry or a large social media following. It will keep the antiphonal structure, keep the final turn through theology or therapy. You will get something that reads smoothly enough. It will probably feel like a poem.
The particularity that makes “For Anne Gregory” a great poem isn’t that she was a real person but that she is a particular subject in a particular milieu, with a particular relation to Yeats and to the social codes around beauty, hair, and youth in County Galway, Ireland. Particularity here means more than a real name. It means all the ways this young girl was embedded in her culture. Not every culture fixates on blond hair. Not every culture has young women bareheaded in public. Not every culture features as a common sight an “old religious man” who turns to texts and speaks in a slightly arch way about a young woman’s looks. A great deal of Yeats’s culture is in the poem.
For the Mercor process, that entire layer of meaning is mostly an obstacle. Its poets are paid to define for the AI model a rubric that features good structure, good emotional tone, good technique for broad classes of prompts. The rubric likely does not register the importance of particularity and culture and local (or niche) knowledge. That’s not the point of a rubric.
Models can imitate the pattern of the poem. They can swap in other tokens that fit the same syntactic and tonal mold. They can even gesture at cultural context if prompted, as Gwern’s extraordinary Pindaric Ode project shows. I have no doubt that if I asked Gwern for a Yeats poem he would start with Irish particularity (by which I mean he’d ask his LLM team to research). But without the prompting of Gwern or a human, an LLM cannot originate a poem whose particularity pushes back on the pattern, a poem that belongs to a specific life and a specific historical network and then radiates out from there.
From Gwern to Great
Gwern’s experiments and Mercor’s platform are two different ways of putting LLMs next to poetry, both of which say something about poetry. If there will be greatness in LLM poetry, Gwern will be the first to achieve it.
Gwern treats models as collaborators inside the workshop where particular poems are made. He partners with them for analysis, brainstorming, self-critique, drafting, redrafting, and editorial argument. He matches different models to different roles, the way a good teacher matches different students to different exercises. He produces work that feels alive and that might, over time and with revision, move toward greatness, because it remains anchored in specific formal problems and specific imaginative projects.
Mercor uses poems inside the workshop of generalized reward models. The object is not any individual poem or poet. The object is a set of signals that can be reused for law, medicine, consulting, marketing, and whatever counts as “power user” satisfaction in the products that sit on top.
If greatness means that a poem about one life in one situation can resonate to other lives and cultures far away and long ago, then I’m not sure Mercor’s system will be able to capture that, certainly not at scale. Particularity and culture cannot scale, by definition. I’m not quite sure that even if they wanted to work at the artisanal level, Mercor’s rubrics and evals would produce something for which a reader might say “I believe that this poet captured this truth there, in a place that has nothing to do with me, and yet it touches me here, where I am.” This is greatness, not desirability. But Gwern may get there.
I’m fascinated to see where Gwern’s experiments go and Mercor’s as well.
Great poems are about taste in the Kantian sense, as Tyler notes, though I prefer also Gwern’s observation: “AI can’t eat the ice cream for you.” Operationally, greatness is measured when tastemakers put them in anthologies so that generations of readers can read them, tear out the ones that resonate, and tape them to their refrigerators.