LLM 预测我的咖啡
LLMs predict my coffee

原始链接: https://dynomight.net/coffee/

这项实验测试了大型语言模型(LLM)是否能准确预测倒入陶瓷杯中的沸水冷却速率。尽管实验设置看似简单——将8盎司沸水(226.8克)倒入一个1.25磅的杯子,环境温度为20°C——作者承认这个问题很复杂,涉及传导、对流、蒸发、辐射以及许多未指明的变量。 几个LLM(Kimi、Gemini、GPT、Claude、Qwen、GLM)被要求提供一个预测水温随时间变化的方程。所有模型都生成了基于指数衰减项的方程,试图模拟快速和缓慢的热传递。然而,与实际实验(每5-300秒记录一次温度)相比,LLM的预测结果*不准确*——低估了初始冷却速率,高估了后期的速率。 Claude 4.6 Opus 表现最好,但仍然不完美,而且成本最高。作者得出结论,虽然LLM可以提供合理的近似值,但目前还不足以准确模拟复杂的物理现象,而人类直觉(在这种情况下,观察到更快的初始冷却)仍然很有价值。

## LLM 与咖啡冷却:一个出乎意料的复杂问题 最近一篇Hacker News上的帖子讨论了尝试使用大型语言模型(LLM)来预测咖啡的冷却速度。然而,这个实验并非关于*制作*咖啡,而是关于水的热量散失物理学。作者的任务是让LLM生成方程式来模拟温度衰减,结果它们大多生成了合理但并不完美的结果,类似于牛顿冷却定律,并带有多个指数项。 评论者很快指出这个问题涉及的复杂性,包括对流、蒸发以及杯子本身对热量的吸收等因素。许多人认为LLM的成功并不令人惊讶,因为其背后的物理学是成熟且易于在训练数据中获得的。 讨论的重点在于精确建模的细微之处——初始条件(如倒入温度)的重要性、材料属性,甚至杯子温度的影响。一些用户分享了他们自己的相关项目,例如意式咖啡机调校应用程序,强调了在冲泡过程中精确度的实际挑战。最终,这篇帖子引发了关于准确建模即使是看似简单的物理现象所需的理解水平的争论。
相关文章

原文

Coding, math, whatever. Can LLMs predict the outcomes of physical experiments?

Suppose I pour 8 oz (226.8 g) of boiling water into a ceramic coffee mug that weighs 1.25 lb (0.57 kg). The ambient air is still and 20 degrees Celsius. The cup starts at room temperature. Give me an equation for the temperature of the water in Celsius over time. The only free variable in the equation should be the number of seconds t since the water was poured. Focus on accuracy during the first 5 minutes.

Does that seem hard? I think it’s hard. The relevant physical phenomena include at least:

  1. Conduction of heat between the water, the mug, the air, and the table.
  2. Conduction of heat inside each of those things.
  3. Convection (fluid movement) inside the water and the air.
  4. Evaporation cooling as water molecules become vapor.
  5. Movement of water vapor in the air.
  6. Radiation. (Like all matter, the mug and water emit temperature-dependent infrared radiation.)
  7. Surface tension, thermal expansion/contraction, re-absorption of air into the water as it cools, probably more.

And many details aren’t specified in the prompt. Is the mug made of porcelain or stoneware? What is the mug’s shape? What is the table made of? How humid is the air? How am I reducing the spatially varying water temperature to a single number?

So this isn’t a problem with a “correct” answer that you can find by thinking. Reality is too complicated. Instead, answering question requires “taste”—guessing which factors are most important, making assumptions about missing details, etc.

So I put that question to a bunch of LLMs. Here is what they said:

(Technically, they gave equations as text. I’m plotting those equations.)

I was surprised by those curves, both in terms of how fast they think the temperature will drop in the beginning, and how slowly they think it will drop later on. They think you get as much cooling in the first few minutes as you do in the rest of the hour. Can that be right?

Then I did the experiment. First, I waited until the ambient temperature happened to reach 20 degrees Celsius. Then, I put 8 oz of water into a measuring cup, microwaved it until it reached a boil, let the temperature equalize a bit, and then microwaved it until the water boiled again. Then, I poured the water into a 1.25 lb coffee mug with a digital thermometer in it and shouted out measurements every five seconds, which were frantically recorded by the Dynomight Biologist. Gradually I reduced measurements to every 15 seconds, 30 seconds, 1 minute, and then 5 minutes.

Behold:

Or, here’s a zoomed-in view of the first five minutes:

The predictions were all OK, but none were great. Probably Claude 4.6 Opus did best, albeit after consuming $0.61 of tokens. (Insert joke about physical experiments / Department of Defense / money / coffee.)

That said, what surprised me about the predictions was how quickly the temperature dropped in the first few minutes, and how slowly it dropped later on. But experimentally, it dropped even faster early on, and even slower towards the end. So if you wanted to ensemble my intuition with the LLM, I guess my intuition would get a weight of zero.

In conclusion, they may take our math, but they’ll somewhat more slowly take our fine motor control. Thank you for reading another middle-school science project.

(Appendix: The equations)

Here were the actual equations all of the models gave for T(t), the predicted temperature after t seconds.

LLM T(t) Cost
Kimi K2.5 (reasoning) 20 + 52.9 exp(-t/3600)+ 27.1 exp(-t/80) $0.01
Gemini 3.1 Pro 20 + 53 exp(-t/2500) + 27 exp(-t/149.25) $0.09
GPT 5.4 20 + 54.6 exp(-t/2920) + 25.4 exp(-t/68.1) $0.11
Claude 4.6 Opus (reasoning) 20 + 55 exp(-t/1700) + 25 exp(-t/43) $0.61 (eeek)
Qwen3-235B 20 + 53.17 exp(-t/1414.43) $0.009
GLM-4.7 (reasoning) 20 + 53.2 exp(-t/2500) $0.03

Interestingly, they were all based on one or two exponentially decaying terms. The way to read these is to think of exp(-t/b) as a function that starts out at one when t is zero, and gradually decreases. After b seconds, it has dropped to 1/e ≈ 0.368, and it continues dropping by factors of 0.368 every b seconds forever.

So most of these models have a “fast rate” which reflects heat flow from the water into the mug along with a “slow rate” for heat from the water/mug to flow into the air. A few of the models skip the fast rate. I also tried DeepSeek and Grok but they just flailed around endlessly without ever returning an answer. They were kind enough to charge me for that service.

联系我们 contact @ memedata.com