《正如岩石可能思考》

《正如岩石可能思考》
As Rocks May Think

原始链接: https://evjang.com/2026/02/04/rocks.html

## 自动化思考与研究的兴起人工智能的最新进展，特别是像Claude这样的编码代理，正在从根本上改变我们与计算机交互和进行研究的方式。作者详细描述了一种从手动编码到利用人工智能自动化复杂任务的转变——以重新实现AlphaGo为例。这不仅仅是代码生成；这些代理可以形成假设，设计实验，分析结果，甚至建议未来的研究方向，有效地充当“自动化科学家”。一项关键创新是创建标准化的“实验”命令，允许人工智能独立运行和报告研究，从而大大提高生产力。这超越了传统的自动化调优系统，因为这些代理可以调整代码并推理结果。作者预测计算机科学将迎来“黄金时代”，以前难以解决的问题将随着现有资源的普及而变得可解。这将推动对推理计算的巨大需求，其影响可与空调对全球生产力的影响相媲美。最终，这项技术有望实现创新民主化，从而能够在各个领域实现快速开发和问题解决——从软件工程到科学发现，甚至可能包括数学证明。

一篇名为《如岩石般思考》的文章引发了 Hacker News 的讨论，凸显了人们对人工智能快速发展持怀疑态度。一位名为“measurablefunc”的评论者用人工智能发展面临的物质限制来缓和人们的兴奋情绪。他们指出一个关键瓶颈：GPU 的生命周期（1-3年）与不断增长的需求，预计到2027-2028年将超过生产能力。主要的芯片制造商台积电已经难以满足需求，数据中心功耗日益增加，并在2025年达到极限。该评论者认为，乐观的人工智能叙述往往忽视了与材料、生产和能源相关的这些现实约束，从而对技术的实际进展呈现了有偏见的观点。基本上，“思考的岩石”需要大量的物理资源，而这些资源正变得越来越稀缺。

原文

You are viewing the mobile version of this page. This content is best viewed on a desktop.

Chief among all changes is that machines can code and think quite well now.

Like many others, I spent the last 2 months on a Claude Code bender, grappling with the fact that I no longer need to write code by hand anymore. I've been implementing AlphaGo from scratch (repo will be open sourced soon) to catch up on foundational deep learning techniques, and also to re-learn how to program with the full power of modern coding agents. I've set up Claude to not only write my infra and research ideas, but also propose hypotheses, draw conclusions, and suggest what experiments to try next. For those of you reading on desktop & tablet, the right side of this page shows examples of real prompts that I asked Claude to write for me.

For my "automated AlphaGo researcher" codebase, I created a Claude command /experiment which standardizes an "action" in the AlphaGo research environment as follows:

Create a self-contained experiment folder with datetime prefix and descriptive slug.
Write an experiment routine to a single-file python file and execute it.
Intermediate artifacts and data are saved to data/ and figures/ subdirectories. All files are stored in easy-to-parse formats like CSV files that can be loaded with pandas.
Observe the outcome and draw conclusions from the experiment, suggest what is still unknown and what is now known.

The outcome of the experiment is a report.md markdown that summarizes the latest observation about the world (example).

Here is an example of how I'd use it:

> /experiment I'd like to apply maximal update parameterization to find the best hyperparameters to run my model on as I scale it up. Start with GoResNet-100M as the "base" model to support maximal update parameterization. Use https://github.com/microsoft/mup package if it helps, making sure to add it to pyproject.toml so that it is installed as a dependency. Utilize d-muP https://arxiv.org/abs/2310.02244 as well to ensure depth-wise stability transfer. Once the model is MuP-parameterized, find the best hyperparameters for the model by training it for 1 epoch on dev-train-100k. You can submit up to 4 parallel Ray jobs at a time to train models. Evaluate validation loss and accuracy after every 500 steps. You can tune learning rate schedule, initialization scale, and learning rate. I think critical batch size should be around 32-64. You can refer to 2025-12-26_19-13-resnet-scaling-laws.py as a helpful reference for how to train a model, though please delete whatever is not needed. For all runs, save intermediate checkpoints every 1k steps to research_reports/checkpoints

I can also ask Claude to run sequential experiments to optimize hyperparameters serially:

/experiment Run a series of experiments similar to 2025-12-27_22-18-mup-training-run.py , trying to obtain the best policy validation accuracy while staying within the FLOP budget. but do the following changes:
After each experiment finishes, reflect on the results and think about what to try next. Generate a new experiment script with changes.
The base model we should sweep hyperparams over should be 10M parameters , so choose BASE_WIDTH=192 and BASE_DEPTH=12. We will tune this model. DELTA_WIDTH=384 and DELTA_DEPTH=12.
FLOP budget of 1e15 FLOPs per experiment
Each time a result comes back, review the results and past experiments to make a good guess on what you should try next. Make 10 such sequential experiments, and write a report summarizing what you've learned

Unlike the prior generation of "automated tuning" systems like Google's Vizier, which use Gaussian Process bandits over a user-defined space of hyperparameters, modern coding agents can change the code itself. Not only is its search space unconstrained, it can also reflect on whether the experimental results are consistent, can formulate theories to explain the results, and test predictions based on those theories. Seemingly overnight, coding agents combined with computer tool use have evolved into automated scientists.

Software engineering is just the beginning; the real kicker is that we now have general-purpose thinking machines that can use computers and tackle just about any short digital problem. Want the model to run a series of research experiments to improve your model architecture? No problem. Want an entire web browser implemented from scratch? Takes a while, but doable. Want to prove unsolved math problems? They can do it without even asking to be a co-author. Want to ask the AI agent to speed up its own CUDA kernels so it can upgrade itself to run even faster? Scary, but ok.

Excellent debugging and problem solving fall out of reasoning, and those skills in turn unlock the ability to doggedly pursue goals. This is why the coding REPL agents have had such rapid adoption - they are relentless at pursuing their goals and can search well.

We are entering a golden age in which all computer science problems seem to be tractable, insomuch as we can get very useful approximations of any computable function. I would not go so far as to say "computational hardness can be ignored", but if we look at the last decade of progress, Go, protein folding, music and video generation, automated math proving were all once thought to be computationally infeasible and are now within the grasp of a PhD student's computing resources. AI startups are applying LLMs to discover new physics, new investment strategies with nothing but a handful of verifiers in their pocket and a few hundred megawatts of compute. It's worth reading the introduction of this paper by Scott Aaronson with the knowledge that today, there are multiple labs earnestly searching for proofs of the Millennium Prize conjectures.

I am being intentionally over-exuberant here, because I want you to contemplate not AI's capabilities in this absolute moment in time, but the velocity of progress and what this means for capabilities in the next 24 months. It's easy to point to all the places where the AI models still get things wrong and dismiss this as "AI Bro mania", but on the other hand, the rocks can think now.

Coding assistants will soon become so good that they can conjure any digital system in an effortless way, like having a wish-granting genie for the price of $20 a month. Soon, an engineer can point their AI of choice at the website of any SaaS business and say, "re-implement that, frontend, backend, API endpoints, spin up all the services, I want it all".

What does it mean to reason?

AlphaGo

LLM Prompting Era

DeepSeek R-1 Era

Where is Reasoning Going?

The Market Cap of Thought

It changed the nature of civilization by making development possible in the tropics. Without air conditioning you can work only in the cool early-morning hours or at dusk.
— Lee Kuan Yew, on air conditioning

Automated research will soon become the standard workflow in high-output labs. Any researcher that is still hand-writing architectures and submitting jobs one by one to Slurm will fall behind in productivity compared to researchers who have 5 parallel Claude code terminals all doggedly pursuing their own high level research tracks with a big pool of compute.

Unlike the massive hyperparameter search experiments that Googlers used to run, the information gain per-FLOP in an automated research setup is very high. Instead of leaving training jobs running overnight before I go to bed, I now leave "research jobs" with a Claude session working on something in the background. I wake up and read the experimental reports, write down a remark or two, and then ask for 5 new parallel investigations. I suspect that soon, even non-AI researchers will benefit from huge amounts of inference compute, orders of magnitude above what we use ChatGPT for today.

Modern coding agents are profoundly useful for teaching and communication as well. I'm looking forward to every codebase having a /teach command that helps onboard contributors of any skill level, recalling the very trails of thought that the original designers went through, just like Vannevar Bush predicted in As We May Think.

Based on my own usage patterns, it's beginning to dawn on me how much inference compute we will need in the coming years. I don't think people have begun to fathom how much we will need. Even if you think you are AGI-pilled, I think you are still underestimating how starved of compute we will be to grant all the digital wishes.

As air conditioning unlocked productivity in the global south, automated thinking will create astronomical demand for inference compute. Air conditioning currently consumes 10% of global electricity production, while datacenter compute less than 1%. We will have rocks thinking all the time to further the interests of their owners. Every corporation with GPUs to spare will have ambient thinkers constantly re-planning deadlines, reducing tech debt, and trawling for more information that helps the business make its decisions in a dynamic world. 007 is the new 996.

Militaries will scramble every FLOP they can find to play out wargames, like rollouts in a MCTS search. What will happen when the first decisive war is won not by guns and drones, but by compute and information advantage? Stockpile your thinking tokens, for thinking begets better thinking.