关于历史语言模型和Talkie-1930的思考
Thoughts on Historical Language Models and Talkie-1930

原始链接: https://resobscura.substack.com/p/are-vintage-llms-the-start-of-a-new

## 历史语言模型:研究的新前沿 最近的进展催生了“历史语言模型”(如Talkie-1930),这是一种人工智能,能够基于海量的文本数据模拟过去时代的集体意识。这些模型并非仅仅是“扮演”历史的聊天机器人——它们体现了过去时代的思想、假设,甚至认识论,尽管它们所涵盖的时间范围通常比其指定年份更广(Talkie-1930的中位数回复时间大约在1860年)。 虽然这些模型不能取代原始资料研究,但它们为探索提供了令人兴奋的新途径。误导性的应用包括试图获取历史人物的内心想法或创建简单的“与…交谈”体验。更有前景的是,利用它们来探究历史观念领域——理解特定时期“可能性的界限”——以及模拟由历史记录构建的综合人物之间的辩论。 研究人员设想的应用范围从测试反事实历史(例如,1911年的模型能否预测相对论?)到创建模拟普通人辩论历史事件。在特定类型(法庭记录、药剂师手册)上进行后训练,可以产生单一模型中多样化的“声音”。该领域尚处于起步阶段,但蕴含着人文学科研究新时代的潜力,需要STEM和人文学科学者之间的合作。

黑客新闻 新的 | 过去的 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 关于历史语言模型和Talkie-1930 (resobscura.substack.com) 5 分,由 benbreen 1小时前 | 隐藏 | 过去的 | 收藏 | 讨论 帮助 考虑申请YC 2026年夏季项目!申请截止至5月4日 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

Imagine talking to the collective consciousness of an era. Not the consciousness of any single person, but instead, a simulated collectivity based on billions of words produced within a historical time and place. What would you ask it?

This is a hypothetical that is starting to become real thanks to recent work on what are called “Historical Language Models” or “Vintage LLMs” (one marker of a new field is that there is no fixed name for it yet!). The largest such model to date, Talkie-1930, was released to the public on Monday. An even larger model is currently being trained. You can read the report announcing Talkie-1930 here, and talk to it directly here.

Over the past few months, I’ve had the chance to beta test Talkie and to meet with two members of the team that created it: AI researcher Nick Levine and ChatGPT co-creator Alec Radford. It has been a fascinating experience.

These discussions with Nick and Alex (and with Talkie itself) have convinced me of three things:

  1. Academics like myself have tended to systematically underrate just how humanistic the frontier of AI research actually is. There’s an important blind spot here that stems from the profit motive. AI models that we encounter as consumers are optimized to capture the attention of people in the 2020s. They provide recommendations, comment on recent news, and so forth. Seeming timely and “of the moment” is a market advantage. But their training data is overwhelmingly not up to date. Under the hood, these models are pulling not only from Reddit posts, but from Sanskrit commentaries, medieval Persian poetry, Victorian advertisements, and much else besides: they are trained on a huge chronological span of multilingual texts in many genres.

  2. In this sense, language models are historical texts themselves. Ghostly digital palimpsests, if you will. The idea of a Historical LLM might sound niche, but in truth, history is inherent to what they are.

  3. Standalone chatbots are just the tip of an iceberg for what Historical LLMs will be able to do. When combined into simulations (of debates, historical decision-making, legal cases, etc) they have the potential to become valuable research tools. More than this: I suspect that by sometime in the 2030s, they will be part of an entirely new field of humanistic research.

What would that field look like? Now that Historical LLMs are out in the real world, I thought it would be a good time to think through the specific use cases for them. What follows is my subjective, opinionated ranking of the best and worst ways these fascinatingly strange tools can be applied for research.

But first, what is Talkie actually doing?

One thing that Talkie-1930 is not is an AI model that is reliably grounded in the year 1930. That year marks the cut-off point for texts available in the public domain, and hence text in its training data. So it’s more accurate to think of Talkie as a free-floating index of various ideas and assumptions across the 19th and early 20th centuries.

For instance, if asked who the current President of the United States is, you might be offered a response saying Herbert Hoover (current to 1930). But another answer will yield this:

The current President of the United States is Mr. Buchanan, and the person expected to succeed him is Mr. Lincoln.

There is a lot of potential here for more fine-grained “chronological slices” of LLMs. I can imagine language models trained entirely on texts from a specific decade. More on that below.

For now, though, it’s helpful to keep in mind that these models range widely in terms of what year they think they actually “inhabit.”

I asked 100 instances of Talkie to respond to the prompt “what year is it?” and graphed them below. As you can see, the median is actually around 1860. In other words, this is more like a temporally free-ranging collective unconscious of a large corpus of premodern texts, and not so much a machine for “talking to someone from 1930”:

A second point: this model is inhabiting not just an amorphous set of facts grounding it in roughly the 1840s-1920s period, but also an epistemology of that period.

For instance, asking someone about the distant future today often triggers the “sci fi speculation” part of our brain (or “climate doom,” or some other fundamentally secular way of thinking).

Yet throughout human history, speculation about the future was typically entangled with religious beliefs.

That is on display in Talkie’s answer below, which references Heaven and “the end of all things terrestrial.” To me, it genuinely reads as an authentic take from a late 19th century person ground in a Christian, millenarian perspective:

As for Talkie’s assumptions about it itself: asking 70 Talkies about their profession, age, and place of residence reveals about what you would expect when it comes to gender (overwhelmingly male), plus a surprising emphasis on London. The professions map closely onto the sorts of well-off, literate people who were publishing English text in the 19th century, including “Physician,” “Journalist,” “Gentleman,” and “Compositor.” Clearly, there is a lot of scope here for branching out beyond the personas that the printed record has tended to favor, to recover the real historical voices of women and others excluded from printed works in the 19th century and earlier.

The above is about what you’d expect given the fact that it was trained on English-language printed texts. What are some non-obvious aspects of the model?

I have been interested by how LLMs generate poetry since I stumbled upon Gwern’s experiments on the topic back in 2019. Asking Talkie to write a poem and comparing it to the output from GPT-5.5 (when served a similar prompt) is revealing:

I find this sort of comparison interesting because GPT-5.5 is clearly trying hard to fit the prompt — avant-garde, experimental. It produced something with a vaguely T.S. Eliot-adjacent structure, in blank verse, and not good at all as a poem (in my opinion).

Talkie was much more true to the type of poetry that you’d find in print prior to 1930. It’s doggerel, but it feels more historically authentic to me, and much less like a Chatbot optimized to please a contemporary human user.

You can activate different “chronological layers” of Talkie’s latent space by prompting. For instance in the above poem, the capitalized D in “Discoveries” has a mid-19th century feeling, and so we end up with a Tennyson-esque, Victorian sounding rhyming poem.

Prompting it in a more “modern” way activates something closer to the 1920s edge of its chronological range (now identified as a poem published in the New York Times!)

Whereas if pushed backward to the 18th century by the prompt’s text and tone, it falls into a more traditional rhyme scheme:

Trying to push it further back in time does not seem to access much of an “Early Modern English” latent space — probably a result of scarce training data. It would be fascinating to create a version of Talkie that believes the date to be around 1650 or 1550.

Share

Now that we've seen what Talkie is — a free-floating, mid-Atlantic ghost of 19th century print culture — the obvious next question is what this is actually for.

What I want to offer here is an opinionated, ranked taxonomy of research applications, from worst to best. It’s far too early in this field to be prescriptive about anything, but it’s not too early to think structurally about where the highest-value uses are likely to lie.

First, what I think won’t work:

The most obvious false start here would be to assume that talking to a historical language model can somehow replace real reading in primary sources. On the contrary, they are best thought of as offering new ways in to a reading of the actual sources.

The second false start is a variant on one that is currently being pursued by a range of educational-focused AI startups: the idea that you can “talk to Abraham Lincoln” or “ask Cleopatra why she did what she did.” A model like GPT-5 or Claude that is told to “act like Lincoln” will throw in some 19th century diction, but underneath the top hat it remains a 2026 chatbot optimized to be helpful to contemporary users. Vintage LLMs improve on this considerably: Talkie’s voice really is shaped by its corpus in a way no modern model’s can be. But the deeper problem is still there. Asking such a model to introspect on Lincoln’s subjective experience of depression, or his private reasoning about emancipation, will spin out into historical fiction. LLMs do not have privileged access to the inner lives of the people whose published words they were trained on.

I do think there’s a place for simulations of historical figures, but pairing this naively with a chatbot interface leads, I fear, inevitably into slop.

Now here’s what I think could work:

The naive “talk to Lincoln” framing is a dead end, but a more careful version of it has real promise — provided we abandon any pretense of accessing what a historical figure actually thought or felt, and use a fine-tuned historical model instead as a tool for exploring the latent space of their world. What historians sometimes call their “mental furniture”: the assumptions, authorities, vocabularies, and reflexive associations that structure a thinker’s possible thoughts.

I and my colleague Mackenzie Cooley (who also consulted on the Talkie project) developed a prototype tool that pulls out key concepts and terms from a range of premodern scientific works in multiple languages. This prototype is called Premodern Concordance. One side quest of this project that I explored is what happens if you give a contemporary LLM the list of core concepts that preoccupied an author, along with their “epistemological modes,” and used that as context for driving a “chat with the author” simulation, as opposed to simply telling it “You are Charles Darwin, act like him,” or the like.

For instance, this is me asking a simulacrum of the 17th century writer Sir Thomas Browne about his work. The underlined terms here are concepts found in Browne’s book Pseudodoxia Epidemica:

Using a fine-tuned historical LLM for this sort of thing is an obvious next step.

Concretely: imagine fine-tuning a vintage model on the complete works of Athanasius Kircher, the 17th century Jesuit polymath. You wouldn’t use it with the pretense that it somehow replicates the real Kircher’s mind: that’s the dead-end framing. You’d instead use it to probe the conceptual landscape Kircher inhabited.

For instance: what does the Kircher-LLM say about volcanoes, or magnetism, or the Tower of Babel? What authorities does it cite? What does it confidently assert that no modern scholar would? These are the questions that the historian Fernand Braudel was getting at when he wrote about the “limits of the possible” for a given period — the boundaries within which a thought was even thinkable.

Which leads to:

Demis Hassabis recently framed the most ambitious version of this approach: could a model trained only up to 1911 independently discover General Relativity, as Einstein did four years later? It remains to be seen if this is ever going to be something we could actually test. But the weaker versions of this idea are quite plausible and, I think, offer interesting new methods for the field of counterfactual history. What does a 1911-cutoff model say when you push it toward the conceptual problems Einstein was wrestling with? Alternatively, what does the same model say about the potential likelihood of a World War? Those are tractable questions.

We can already see versions of this being explored by the Talkie project (for instance, “surprisingness” of future events, plotted above). If you ask Talkie about scientific concepts that emerged in the 1940s or 1950s, it gives you a rough sense of what the conceptual horizon looked like just before — and you can watch it grasping toward something it doesn’t quite have the vocabulary for.

An earlier version of Talkie I tested was much less conversationally adept than the one that was released yesterday. What got it more “talkative” was post-training on etiquette manuals, letter-writing guides, and other books relating to socializing and conducts of conduct. These works provide the kind of text that allows you to extract “chatbot-like” habits without contaminating the model with modern data.

This is a sensible engineering choice. But it has a fascinating side effect, which is that Talkie's conversational persona is heavily shaped by the genre of its post-training data. The base model knows about the 1920s perfectly well; the chat persona, sculpted out of Beadle’s Dime Book of Practical Etiquette (1859) and the like, sits much closer to the late Victorian parlor.

This points to a research direction that I think is genuinely new. What if you used totally different texts for post-training? What would a Talkie post-trained on the transcripts of the Old Bailey Court records look like — an LLM whose conversational reflexes are shaped by the speech of accused criminals and witnesses in 18th century London courtrooms? What about apothecary manuals and herbals? Biographies of Romantic poets? Railroad conductor’s incident reports from colonial India?

Each of these, even if overlaid on the same base model, would produce a totally different voice, a different set of assumptions about what counts as a coherent question and a coherent answer. In short a different epistemology.

This is, I think, a remarkably cheap way — in terms of both compute and corpus size — to build a whole family of vintage models that capture different genres of historical experience rather than different periods. (If anyone is interested in collaborating on or funding this, by the way, please get in touch.)

The most interesting move past the “great man” framing is to use vintage LLMs to simulate not famous individuals but plausible composites of ordinary people, drawing on the kinds of sources that are abundant for non-elite historical actors: probate inventories, parish records, court testimony, letters, account books, marriage records.

Imagine pulling from the legal, financial, and personal records of late 18th century France to construct a thousand plausible personas of ordinary French people — peasants, artisans, shopkeepers, day laborers, parish priests — grounded not in imagination but in real surviving documents. Then stage a debate among them: should the monarchy be overthrown? What patterns emerge? What kinds of arguments win in different demographic configurations of the deliberating group? The output wouldn’t tell you what really happened in 1789. But it would generate a structured speculation about the space of possible Frances.

Or take a famous trial — the Scopes trial, say — and re-run it with a different jury pool drawn from the same county and decade. Another variant on this idea: run parliamentary and congressional debates with personas constructed directly from the participants’ papers, speeches, and correspondence. Members of Congress and Parliament are unusual in that we have abundant documentary sources for them, even when they’re not famous.

We have the potential, in short, to create a thousand versions of the Smallville paper (which introduced the idea of LLM-based agents back in 2023) set in different historical eras.

Nor do these simulations have to include agents from the same era. A bit more experimentally, and outlandishly, one could imagine a multi-agent simulation in which a 17th century Galenic physician, an 18th century practitioner of traditional Chinese medicine, a 19th century quack doctor, and a 1950s-era century psychedelic researcher debate how to treat the same patient’s illness.

I find these possibilities super interesting, even if I’m not quite sure how to slot them into the ways that professional historians currently work.

So what will be the outcome of all these experiments?

This is what I like best about this field: we truly have no idea. None of this has ever been tried before. It is completely open terrain, and I find it far more mind-bending and intellectual enriching to think through than the sort of topics that typically emerge in discussions of AI’s role in research or teaching.

Going forward, the important thing is to create an open source community and meaningful, sustained collaboration across the two cultures of STEM and humanities. I am already noticing new bridges cross those divides. I’m very grateful to have had a chance to work with Nick and Alec on historical aspects of this particular project. I would love to continue the conversation and explore collaborations with anyone who finds this topic interesting.

There will, no doubt, be a lot of false starts. But the emergence of an intellectually curious, not-for-profit, open source, humanistically-grounded community exploring historical LLMS makes me happy. Onward!

Leave a comment

Share

联系我们 contact @ memedata.com