与 Mythos 合作的感受

与 Mythos 合作的感受
What it feels like to work with Mythos

原始链接: https://www.oneusefulthing.org/p/what-it-feels-like-to-work-with-mythos

作者通过对 Claude 5 Fable 的早期访问发现，人工智能的能力实现了显著飞跃，标志着人机关系发生了转变。与以往的模型不同，Fable 的运作方式如同一个自主工作室，能够以极少的人为干预完成从复杂数据分析到交互式软件开发等耗时数小时的复杂项目。通过利用子代理来执行研究、编写代码和验证结果，Fable 处理了以往需要大量人工监督的繁琐任务。然而，这种能力也带来了“黑箱”效应：AI 会做出无数用户无法察觉的微小决策和判断。作者总结认为，人类的角色正在从直接创作者或“向导”转变为赞助人——即负责委托工作、提供反馈并评判最终成果，而非掌控具体的执行过程。尽管 Fable 的表现令人瞩目，但这引发了一个问题：人类参与度的边缘化究竟是暂时的界面限制，还是 AI 能力提升所带来的必然代价？归根结底，虽然用户依然可以驾驭这些模型，但“工作”的本质已经从主动参与转变为高层监督。

这个 Hacker News 讨论帖探讨了沃顿商学院教授 Ethan Mollick 关于他使用“Mythos”（在评论中常被称为“Fable”）这一 AI 模型的心得体会。文章描述了从传统软件开发向“委托”模式的转变，即用户指挥 AI 构建复杂的项目。 **讨论要点总结：** * **怀疑与热情的碰撞：** 许多评论者批评作者缺乏技术严谨性，指出所生成的代码往往“难以维护”或属于“垃圾代码”。怀疑论者认为，如果没有适当的工程规范，AI 生成的项目会积累技术债务和隐藏的 Bug，导致后期验证的成本极其高昂，甚至无法修复。 * **“氛围编程”（Vibe Coding）：** 争论的核心在于，如果产出能满足用户需求，代码质量是否还那么重要。一些人认为，对于“低风险”或“副业项目”而言，AI 的速度和便捷性胜过了技术上的瑕疵。 * **经济影响：** 专业开发者担心自身技能贬值以及“高 Token 消耗”工作流的长期可持续性。而另一些人则认为，AI 能够让非技术用户也能参与软件创作，从而实现软件开发的平民化。 * **质量担忧：** 批评者指出了作者演示项目中的事实错误，称其为包装在华丽但肤浅的外表下的“幻觉”。

原文

I had early access to the first Mythos-class AI model being released to the public, Claude 5 Fable. Much of the discussion of Mythos has centered on its impact on software security, but I tested it on everything except that (the guardrails around Fable essentially prevent it from being used for cybersecurity at all). My conclusion is that it represents a very real leap over every model I have used before, and, maybe more important, suggests our relationship with AI is changing in drastic ways.

First, how good is Fable? In experiment after experiment I conducted, it outperformed basically every other public model I have used by a considerable margin. It was capable across many problems and produced some startling results — it would work up to a dozen hours executing on multi-page specifications. I’ll walk you through a couple of more complex, and serious, use cases shortly, but you could see the general improvement across the board on every task. The problem about communicating this in a post is that many of the most impressive results are going to be interesting to only small portions of my readers. For example, it made the most sophisticated academic social science paper I have yet seen from an AI from a single prompt and one piece of feedback. It also created a 10-page epic rhyming poem about a haircut where every word starts with the letter s.

So, as a more accessible and entertaining example, I also had it create a bunch of games you can try. All of these are one initial prompt in Claude Code where Fable had to take my vague prompts and generate something workable, followed by a couple of additional prompts with minor encouragement (“make it better”) or feedback. What makes these especially impressive is that Claude cannot generate images, so every piece of art or 3D object was made with math alone, not using any external assets. You can try any of them: a game about flipping coins (prompt: “Balatro, but for the game of coin flips”) that is quite fun; a snake game where the snake is self-aware and crazy things happen; the work of a famous German Romantic poet translated into an art game (“the Duino elegies as a game. get the mood right”); or a game about descending into the depths to see what is there.

So the output is impressive. But, especially as I turned to more serious projects, I often felt using the tool was somewhere between delightful and unnerving. Delightful because I just asked for something at it happened. And also unnerving because I just asked for something and it happened.

To see why, it helps to understand the way in which Fable gets work done, and for that I want to turn to an example I have tested on many previous AI models: building an isochrone map. This is a map that shows the distance you can travel in a given length of time, and the first one was created in 1881 showing travel times from London.

No previous model did an even halfway useful job with trying to create a map like this because it involves researching thousands of potential trip distances and a lot of small judgement calls and decisions. I decided to try it on Fable using Claude Code with this prompt: i want you to build a fully researched and beautiful isochronic map that lets me pick various cities and see real isochronic lines based on real data. I want the design to be unique. You should take into account airports (and travel time to and from airports) trains, walking, driving. The data does not need to be live but should be real based on your research and data. You can start with a few cities but more general is better, this should be an entirely new project. It then suggested that it do this in the style of the original map. I agreed, and it got to work.

It is worth a second looking at the transcript of the multiple hour building session the AI went through on its own, because you can see some unusual things. First, the AI launched multiple other AIs (I believe mostly the cheaper Claude Sonnet) to help it conduct research on travel times, ultimately retrieving over 2,200 specific flights, the rail schedules for trains from the TGV to the Shinkansen, and road speeds per country from multiple academic papers. And while those agents were running, it started coding. Then it launched yet more agents and tests to verify its code, all the while taking notes about its progress.

The result was a fully functioning map of impressive sophistication that looked a lot like the 1881 original, but that doesn’t mean it was perfect. I noticed that a lot of remote locations (like Greenland) just contained estimates of travel time, not exact numbers, so I told Fable to fix it, including the instructions: actually get travel times to remote airports and locations. This time the AI launched a workflow, adversarial groups of agents that did research and tested each others results. It figured out how often ships sail to Pitcairn Island in the Pacific and how to get to Grise Fjord from Ottawa. And it used a tremendous number of tokens in a very short period of time (more on this soon).

The results were impressive. I pushed a few more times in directions that interested me (including asking for other visualization approaches, etc.). I would recommend spending a couple minutes clicking around the results, and you can read its methods and sources at the bottom of the graph.

What the AI generated. Click on the map to go to the interactive version

This is probably not a useful project for you unless you really like travel and maps, but it is indicative of AI solving a hard problem involving research, math, visual development, taste, judgement, complex coding, and more. And, the unnerving part was how little I did. I gave a really ambitious instruction, the AI followed it. I gave a couple of minor pieces of feedback, and the AI figured it out. My role was extremely limited.

Importantly, it was just limited in how much work I did relative to the model, it was also limited in how much control I had over how the model did things, why the model chose particular approaches, or even how in-depth its results would be. The details of the AI’s decision making are not shown to me, and the process would be too long to even be worth following. The map required the AI to make judgement calls about hundreds of little choices, and it just made them, without me understanding the choices or having a chance to weigh in. In many ways, it is miraculous (I can always ask for edits at the end) on the other, it turns AI into the ultimate black box.

The most ambitious project I got from Fable takes a little more explanation. I do a lot of research where humans produce messy answers and doing any sort of analysis requires categorize those answers properly: how innovative is an idea? why do people like this book? To figure this out, we used human researchers to make a judgement call about a piece of information, and statistically compare their answers with others to figure out whether we can trust the data. A lot of recent research has shown that AIs might be able to do this important work, but calibrating AI and human judgement has been difficult and expensive. So I asked Fable to solve the problem, first generating a complex 19 page design document and then executing it.

It worked for nine and a half hours.

The result was an extremely sophisticated piece of software the AI called Concord that could take in multiple datasets, calibrate human and AI responses, and then conduct complex data analysis on the results. Again, it wasn’t perfect. As an expert, I was able to spot some errors and omissions (some as a result of the design I had asked for) that I had the AI correct. But the scope of the delivery on this project, and many others, exceeded anything I had seen before. In this case, it was a piece of software that researchers have needed for years but was never profitable to create. You can now just use or modify the code here. I am sure it is not perfect (I only spent an hour working with the results), but a software engineer would iron out the remaining potential bugs that I could not find quickly (which is one reason we may need more, not less, coders in the future, to help with the explosion of new uses for software).

This power goes hand in hand with strangeness and limits. Among those limits is its token usage. Fable is twice as expensive as Opus, and it burns through tokens at a rate that suggests the answer to how much it costs in production is “a lot,” though its clever delegation to cheaper models may lower the real price considerably. The guardrails for Fable also trip at the faintest hint of a security problem, defaulting to the less powerful Claude 4.8 Opus, and it happens way too often. And the jagged frontier is still there. For example, the AI still writes in the same weird style (in fact the software Fable produces bears traces of Claudisms; so do its progress reports, all that carrying the weight and earning the answer). But the deeper strangeness is how little I had to do, and how little I could see while it was being done.

Last year I called this working with a wizard: you chant the spell and something happens. With Fable the spell has gotten powerful enough that I am no longer sure I am the wizard. I am closer to a patron. I describe what I want, I pay for it, and I judge the result. The conjuring happens somewhere I cannot watch, in hundreds of small choices I never get a vote on. The work has shifted from process to outcome. I no longer steer; I commission.

It is possible the sidelining is temporary, just an artifact of interfaces that haven’t caught up, and that we’ll get better windows into what these models are doing and better ways to steer them midstream. It is also possible that the opposite is true: that the more capable the model, the less there is for a human to meaningfully do, and the black box is the price of the power. I suspect that is more likely to be the real direction. None of this is a loss of control in the obvious sense. I can still steer Fable, and it follows instructions remarkably well: the more ambitious the instruction, the better the result. But steering is no longer the same as doing. I brief the model, it spins up its own agents to research and write and check one another’s work, and what comes back is finished. A patron commissions a single artist. Fable is closer to a whole studio, where I am the client who signs off on the final work without ever setting foot on the floor.

与 Mythos 合作的感受 What it feels like to work with Mythos

与 Mythos 合作的感受
What it feels like to work with Mythos