氛围编码 #2

氛围编码 #2
Vibecoding #2

原始链接: https://matklad.github.io/2026/01/20/vibecoding-2.html

## AI 辅助云舰队管理：一次“务实AI”的胜利在2026年初，作者成功利用 Claude 构建了一个工具“box”，用于管理云端中的临时机器集群——尽管技术已经发展了几十年，这项任务仍然出乎意料地繁琐。核心问题：为 TigerBeetle 的确定性测试，轻松启动、连接并在多个云实例上运行命令。该解决方案的灵感来自 rsyscall 的直接式分布式系统编程，Peter 的远程同步/运行工作流，用于无缝的本地到远程开发，以及 dax 的 JavaScript 安全 shell 脚本。 “box” 允许临时集群创建、同步代码执行，以及通过复用接口访问单个机器。最初尝试使用 Claude 进行完全自动的代码生成并不理想（缺乏“特点”且需要重构），但增量方法——提供结构并让 Claude 填充细节——被证明是有效的。Claude 在完成代码、识别错误和根据错误消息提出解决方案方面表现出色，显著加快了开发速度。这个过程强调了当前 LLM 对示例优于规则的价值，以及对代码质量和可维护性进行人工监督的重要性。尽管学习云 API 存在曲线，但 AI 证明了其驾驭这些 API 的能力，并最终产生了一个作者预计未来几年都会使用的工具。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 Vibecoding #2 (matklad.github.io) 18 分，ibobev 发表于 45 分钟前 | 隐藏 | 过去 | 收藏 | 1 条评论 jacobtomlinson 发表于 3 分钟前 [–] 指示不明确，Claude 花了三天和数百万 tokens 从头开始重建 SLURM。/s 回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

Jan 20, 2026

I feel like I got substantial value out of Claude today, and want to document it. I am at the tail end of AI adoption, so I don’t expect to say anything particularly useful or novel. However, I am constantly complaining about the lack of boring AI posts, so it’s only proper if I write one.

At TigerBeetle, we are big on deterministic simulation testing. We even use it to track performance, to some degree. Still, it is crucial to verify performance numbers on a real cluster in its natural high-altitude habitat.

To do that, you need to procure six machines in a cloud, get your custom version of tigerbeetle binary on them, connect cluster’s replicas together and hit them with load. It feels like, quarter of a century into the third millennium, “run stuff on six machines” should be a problem just a notch harder than opening a terminal and typing ls, but I personally don’t know how to solve it without wasting a day. So, I spent a day vibecoding my own square wheel.

The general shape of the problem is that I want to spin a fleet of ephemeral machines with given specs on demand and run ad-hoc commands in a SIMD fashion on them. I don’t want to manually type slightly different commands into a six-way terminal split, but I also do want to be able to ssh into a specific box and poke it around.

My idea for the solution comes from these three sources:

The big idea of rsyscall is that you can program distributed system in direct style. When programming locally, you do things by issuing syscalls:

const fd = open("/etc/passwd");

This API works for doing things on remote machines, if you specify which machine you want to run the syscall on:

const fd_local = open(.host, "/etc/passwd");
const fd_cloud = open(.{.addr = "1.2.3.4"}, "/etc/passwd");

Direct manipulation is the most natural API, and it pays to extend it over the network boundary.

Peter’s post is an application of a similar idea to a narrow, mundane task of developing on Mac and testing on Linux. Peter suggests two scripts:

remote-sync synchronizes a local and remote projects. If you run remote-sync inside ~/p/tb folder, then ~/p/tb materializes on the remote machine. rsync does the heavy lifting, and the wrapper script implements DWIM behaviors.

It is typically followed by remote-run some --command, which runs command on the remote machine in the matching directory, forwarding output back to you.

So, when I want to test local changes to tigerbeetle on my Linux box, I have roughly the following shell session:

$ cd ~/p/tb/work
$ code . # hack here
$ remote-sync
$ remote-run ./zig/zig build test

The killer feature is that shell-completion works. I first type the command I want to run, taking advantage of the fact that local and remote commands are the same, paths and all, then hit ^A and prepend remote-run (in reality, I have rr alias that combines sync&run).

The big thing here is not the commands per se, but the shift in the mental model. In a traditional ssh & vim setup, you have to juggle two machines with a separate state, the local one and the remote one. With remote-sync, the state is the same across the machines, you only choose whether you want to run commands here or there.

With just two machines, the difference feels academic. But if you want to run your tests across six machines, the ssh approach fails — you don’t want to re-vim your changes to source files six times, you really do want to separate the place where the code is edited from the place(s) where the code is run. This is a general pattern — if you are not sure about a particular aspect of your design, try increasing the cardinality of the core abstraction from 1 to 2.

The third component, dax library, is pretty mundane — just a JavaScript library for shell scripting. The notable aspects there are:

JavaScript’s template literals, which allow implementing command interpolation in a safe by construction way. When processing $`ls ${paths}`, a string is never materialized, it’s arrays all the way to the exec syscall ( more on the topic).
JavaScript’s async/await, which makes managing concurrent processes (local or remote) natural:
```
await Promise.all([
  $`sleep 5`,
  $`remote-run sleep 5`,
]);
```
Additionally, deno specifically valiantly strives to impose process-level structured concurrency, ensuring that no processes spawned by the script outlive the script itself, unless explicitly marked detached — a sour spot of UNIX.

Combining the three ideas, I now have a deno script, called box, that provides a multiplexed interface for running ad-hoc code on ad-hoc clusters.

A session looks like this:


$ cd ~/p/tb/work
$ git status --short
 M src/lsm/forest.zig


$ box create 3
108.129.172.206,52.214.229.222,3.251.67.25

$ box list
0 108.129.172.206
1 52.214.229.222
2 3.251.67.25


$ box sync 0,1,2


$ box run 0 pwd
/home/alpine/p/tb/work

$ box run 0 ls
CHANGELOG.md  LICENSE       README.md     build.zig
docs/         src/          zig/


$ box run 0,1,2 ./zig/download.sh
Downloading Zig 0.14.1 release build...
Extracting zig-x86_64-linux-0.14.1.tar.xz...
Downloading completed (/home/alpine/p/tb/work/zig/zig)!
Enjoy!


$ box run 0,1,2 \
    ./zig/zig build -Drelease -Dgit-commit=$(git rev-parse HEAD)


$ box run 0,1,2 \
    ./zig-out/bin/tigerbeetle format \
    --cluster=0 --replica=?? --replica-count=3 \
    0_??.tigerbeetle
2026-01-20 19:30:15.947Z info(io): opening "0_0.tigerbeetle"...


$ box destroy 0,1,2

I like this! Haven’t used in anger yet, but this is something I wanted for a long time, and now I have it

The problem with implementing above is that I have zero practical experience with modern cloud. I only created my AWS account today, and just looking at the console interface ignited the urge to re-read The Castle. Not my cup of pu-erh. But I had a hypothesis that AI should be good at wrangling baroque cloud API, and it mostly held.

I started with a couple of paragraphs of rough, super high-level description of what I want to get. Not a specification at all, just a general gesture towards unknown unknowns. Then I asked ChatGPT to expand those two paragraphs into a more or less complete spec to hand down to an agent for implementation.

This phase surfaced a bunch of unknowns for me. For example, I wasn’t thinking at all that I somehow need to identify machines, ChatGPT suggested using random hex numbers, and I realized that I do need 0,1,2 naming scheme to concisely specify batches of machines. While thinking about this, I realized that sequential numbering scheme also has an advantage that I can’t have two concurrent clusters running, which is a desirable property for my use-case. If I forgot to shutdown a machine, I’d rather get an error on trying to re-create a machine with the same name, then to silently avoid the clash. Similarly, turns out the questions of permissions and network access rules are something to think about, as well as what region and what image I need.

With the spec document in hand, I turned over to Claude code for actual implementation work. The first step was to further refine the spec, asking Claude if anything is unclear. There were couple of interesting clarifications there.

First, the original ChatGPT spec didn’t get what I meant with my “current directory mapping” idea, that I want to materialize a local ~/p/tb/work as remote ~/p/tb/work, even if ~ are different. ChatGPT generated an incorrect description and an incorrect example. I manually corrected example, but wasn’t able to write a concise and correct description. Claude fixed that working from the example. I feel like I need to internalize this more — for current crop of AI, examples seem to be far more valuable than rules.

Second, the spec included my desire to auto-shutdown machines once I no longer use them, just to make sure I don’t forget to turn the lights off when leaving the room. Claude grilled me on what precisely I want there, and I asked it to DWIM the thing.

The spec ended up being 6KiB of English prose. The final implementation was 14KiB of TypeScript. I wasn’t keeping the spec and the implementation perfectly in sync, but I think they ended up pretty close in the end. Which means that prose specifications are somewhat more compact than code, but not much more compact.

My next step was to try to just one-shot this. Ok, this is embarrassing, and I usually avoid swearing in this blog, but I just typoed that as “one-shit”, and, well, that is one flavorful description I won’t be able to improve upon. The result was just not good (more on why later), so I almost immediately decided to throw it away and start a more incremental approach.

In my previous vibe-post, I noticed that LLM are good at closing the loop. A variation here is that LLMs are good at producing results, and not necessarily good code. I am pretty sure that, if I had let the agent to iterate on the initial script and actually run it against AWS, I would have gotten something working. I didn’t want to go that way for three reasons:

Spawning VMs takes time, and that significantly reduces the throughput of agentic iteration.
No way I let the agent run with a real AWS account, given that AWS doesn’t have a fool-proof way to cap costs.
I am fairly confident that this script will be a part of my workflow for at least several years, so I care more about long-term code maintenance, than immediate result.

And, as I said, the code didn’t feel good, for these specific reasons:

It wasn’t the code that I would have written, it lacked my character, which made it hard for me to understand it at a glance.
The code lacked any character whatsoever. It could have worked, it wasn’t “naively bad”, like the first code you write when you are learning programming, but there wasn’t anything good there.
I never know what the code should be up-front. I don’t design solutions, I discover them in the process of refactoring. Some of my best work was spending a quiet weekend rewriting large subsystems implemented before me, because, with an implementation at hand, it was possible for me to see the actual, beautiful core of what needs to be done. With a slop-dump, I just don’t get to even see what could be wrong.
In particular, while you are working the code (as in “wrought iron”), you often go back to requirements and change them. Remember that ambiguity of my request to “shut down idle cluster”? Claude tried to DWIM and created some horrific mess of bash scripts, timestamp files, PAM policy and systemd units. But the right answer there was “lets maybe not have that feature?” (in contrast, simply shutting the machine down after 8 hours is a one-liner).

The incremental approach worked much better, Claude is good at filling-in the blanks. The very first thing I did for box-v2 was manually typing-in:

type CLI =
  | CLICreate
  | CLIDestroy
  | CLIList
  | CLISync

type BoxList = string[];
type CLICreate = { tag: "create"; count: number };
type CLIDestroy = { tag: "destroy"; boxes: BoxList };
type CLIList = { tag: "list" };
type CLISync = { tag: "sync"; boxes: BoxList; };

function fatal(message: string): never {
  console.error(message);
  Deno.exit(1);
}

function CLIParse(args: string[]): CLI {

}

Then I asked Claude to complete the CLIParse function, and I was happy with the result. Note Show, Don’t Tell

I am not asking Claude to avoid throwing an exception and fail fast instead. I just give fatal function, and it code-completes the rest.

I can’t say that the code inside CLIParse is top-notch. I’d probably written something more spartan. But the important part is that, at this level, I don’t care. The abstraction for parsing CLI arguments feel right to me, and the details I can always fix later. This is how this overall vibe-coding session transpired — I was providing structure, Claude was painting by the numbers.

In particular, with that CLI parsing structure in place, Claude had little problem adding new subcommands and new arguments in a satisfactory way. The only snag was that, when I asked to add an optional path to sync, it went with string | null, while I strongly prefer string | undefined. Obviously, its better to pick your null in JavaScript and stick with it. The fact that undefineed is unavoidable predetermines the winner. Given that the argument was added as an incremental small change, course-correcting was trivial.

The null vs undefined issue perhaps illustrates my complaint about the code lacking character. | null is the default non-choice. | undefined is an insight, which I personally learned from VS Code LSP implementation.

The hand-written skeleton/vibe-coded guts worked not only for the CLI. I wrote

async function main() {
  const cli = CLIParse(Deno.args);

  if (cli.tag === "create") return await mainCreate(cli.count);
  if (cli.tag === "destroy") return await mainDestroy(cli.boxes);
  ...
}

async function mainDestroy(boxes: string[]) {
  for (const box of boxes) {
    await instanceDestroy(box);
  }
}

async function instanceDestroy(id: string) {

}

and then asked Claude to write the body of a particular function according to the SPEC.md.

Unlike with the CLI, Claude wasn’t able to follow this pattern itself. With one example it’s not obvious, but the overall structure is that instanceXXX is the AWS-level operation on a single box, and mainXXX is the CLI-level control flow that deals with looping and parallelism. When I asked Claude to implement box run, without myself doing the main / instance split, Claude failed to noticed it and needed a course correction.

However, Claude was massively successful with the actual logic. It would have taken me hours to acquire specific, non-reusable knowledge to write:


const instanceMarketOptions = JSON.stringify({
  MarketType: "spot",
  SpotOptions: { InstanceInterruptionBehavior: "terminate" },
});
const tagSpecifications = JSON.stringify([
  { ResourceType: "instance", Tags: [{ Key: moniker, Value: id }] },
]);

const result = await $`aws ec2 run-instances \
  --image-id ${image} \
  --instance-type ${instanceType} \
  --key-name ${moniker} \
  --security-groups ${moniker} \
  --instance-market-options ${instanceMarketOptions} \
  --user-data ${userDataBase64} \
  --tag-specifications ${tagSpecifications} \
  --output json`.json();

const instanceId = result.Instances[0].InstanceId;


await $`aws ec2 wait instance-status-ok --instance-ids ${instanceId}`;

I want to be careful — I can’t vouch for correctness and especially completeness of the above snippet. However, given that the nature of the problem is such that I can just run the code and see the result, I am fine with it. If I were writing this myself, trial-and-error would totally be my approach as well.

Then there’s synthesis — with several instance commands implemented, I noticed that many started with querying AWS to resolve symbolic machine name, like “1”, to the AWS name/IP. At that point I realized that resolving symbolic names is a fundamental part of the problem, and that it should only happen once, which resulting in the following refactored shape of the code:

async function main() {
  const cli = CLIParse(Deno.args);
  const instances = await instanceMap();

  if (cli.tag === "create") return await mainCreate(instances, cli.count);
  if (cli.tag === "destroy") return await mainDestroy(instances, cli.boxes);
  ...
}

Claude was ok with extracting the logic, but messed up the overall code layout, so the final code motions were on me. “Context” arguments go first, not last, common prefix is more valuable than common suffix because of visual alignment.

The original “one-shotted” implementation also didn’t do up-front querying. This is an example of a shape of a problem I only discover when working with code closely.

Of course, the script didn’t work perfectly the first time and we needed quite a few iterations on the real machines both to fix coding bugs, as well gaps in the spec. That was an interesting experience of speed-running rookie mistakes. Claude made naive bugs, but was also good at fixing them.

For example, when I first tried to box ssh after box create, I got an error. Pasting it into Claude immediately showed the problem. Originally, the code was doing aws ec2 wait instance-running and not aws ec2 wait instance-status-ok.

The former checks if instance is logically created, the latter waits until the OS is booted. It makes sense that these two exist, and the difference is clear (and its also clear that OS booted != SSH demon started). Claude’s value here is in providing specific names for the concepts I already know to exist.

Another fun one was about the disk. I noticed that, while the instance had an SSD, it wasn’t actually used. I asked Claude to mount it as home, but that didn’t work. Claude immediately asked me to run $ box run 0 cat /var/some/unintuitive/long/path.log and that log immediately showed the problem. This is remarkable! 50% of my typical Linux debugging day is wasted not knowing that a useful log exists, and the other 50% is for searching for the log I know should exist somewhere.

After the fix, I lost the ability to SSH. Pasting the error immediately gave the answer — by mounting over /home, we were overwriting ssh keys configured prior.

There were couple of more iterations like that. Rookie mistakes were made, but they were debugged and fixed much faster than my personal knowledge allows (and again, I feel that is trivia knowledge, rather than deep reusable knowledge, so I am happy to delegate it!).

It worked satisfactorily in the end, and, what’s more, I am happy to maintain the code, at least to the extent that I personally need it. Kinda hard to measure productivity boost here, but, given just the sheer number of CLI flags required to make this work, I am pretty confident that time was saved, even factoring the writing of the present article!

I’ve recently read The Art of Doing Science and Engineering by Hamming (of distance and code), and one story stuck with me:

A psychologist friend at Bell Telephone Laboratories once built a machine with about 12 switches and a red and a green light. You set the switches, pushed a button, and either you got a red or a green light. After the first person tried it 20 times they wrote a theory of how to make the green light come on. The theory was given to the next victim and they had their 20 tries and wrote their theory, and so on endlessly. The stated purpose of the test was to study how theories evolved.

But my friend, being the kind of person he was, had connected the lights to a random source! One day he observed to me that no person in all the tests (and they were all high-class Bell Telephone Laboratories scientists) ever said there was no message. I promptly observed to him that not one of them was either a statistician or an information theorist, the two classes of people who are intimately familiar with randomness. A check revealed I was right!

氛围编码 #2 Vibecoding #2

氛围编码 #2
Vibecoding #2