I feel like I got substantial value out of Claude today, and want to document it. I am at the tail end of AI adoption, so I don’t expect to say anything particularly useful or novel. However, I am constantly complaining about the lack of boring AI posts, so it’s only proper if I write one.
At TigerBeetle, we are big on deterministic simulation testing. We even use it to track performance, to some degree. Still, it is crucial to verify performance numbers on a real cluster in its natural high-altitude habitat.
To do that, you need to procure six machines in a cloud, get your
custom version of tigerbeetle
binary on them, connect cluster’s replicas together and hit them
with load. It feels like, quarter of a century into the third
millennium, “run stuff on six machines” should be a problem just a
notch harder than opening a terminal and typing ls, but
I personally don’t know how to solve it without wasting a day. So, I
spent a day vibecoding my own square wheel.
The general shape of the problem is that I want to spin a fleet of ephemeral machines with given specs on demand and run ad-hoc commands in a SIMD fashion on them. I don’t want to manually type slightly different commands into a six-way terminal split, but I also do want to be able to ssh into a specific box and poke it around.
My idea for the solution comes from these three sources:
The big idea of rsyscall is that you can program
distributed system in direct style. When programming locally, you do
things by issuing syscalls:
const fd = open("/etc/passwd");
This API works for doing things on remote machines, if you specify which machine you want to run the syscall on:
const fd_local = open(.host, "/etc/passwd");
const fd_cloud = open(.{.addr = "1.2.3.4"}, "/etc/passwd");
Direct manipulation is the most natural API, and it pays to extend it over the network boundary.
Peter’s post is an application of a similar idea to a narrow, mundane task of developing on Mac and testing on Linux. Peter suggests two scripts:
remote-sync synchronizes a local and remote projects.
If you run remote-sync inside ~/p/tb
folder, then ~/p/tb materializes on the remote machine.
rsync does the heavy lifting, and the wrapper script
implements DWIM behaviors.
It is typically followed by
remote-run some --command,
which runs command on the remote machine in the matching directory,
forwarding output back to you.
So, when I want to test local changes to tigerbeetle on
my Linux box, I have roughly the following shell session:
$ cd ~/p/tb/work
$ code . # hack here
$ remote-sync
$ remote-run ./zig/zig build test
The killer feature is that shell-completion works. I first type the
command I want to run, taking advantage of the fact that local and
remote commands are the same, paths and all, then hit ^A and prepend remote-run (in reality, I have
rr alias that combines sync&run).
The big thing here is not the commands per se, but the shift in the
mental model. In a traditional ssh & vim setup, you have to
juggle two machines with a separate state, the local one and the
remote one. With remote-sync, the state is the same
across the machines, you only choose whether you want to run
commands here or there.
With just two machines, the difference feels academic. But if you want to run your tests across six machines, the ssh approach fails — you don’t want to re-vim your changes to source files six times, you really do want to separate the place where the code is edited from the place(s) where the code is run. This is a general pattern — if you are not sure about a particular aspect of your design, try increasing the cardinality of the core abstraction from 1 to 2.
The third component, dax library, is pretty mundane —
just a JavaScript library for shell scripting. The notable aspects
there are:
-
JavaScript’s template literals, which allow implementing command interpolation in a safe by construction way. When processing
$`ls ${paths}`, a string is never materialized, it’s arrays all the way to theexecsyscall ( more on the topic). -
JavaScript’s async/await, which makes managing concurrent processes (local or remote) natural:
await Promise.all([ $`sleep 5`, $`remote-run sleep 5`, ]); -
Additionally, deno specifically valiantly strives to impose process-level structured concurrency, ensuring that no processes spawned by the script outlive the script itself, unless explicitly marked
detached— a sour spot of UNIX.
Combining the three ideas, I now have a deno script, called box, that provides a multiplexed interface for running
ad-hoc code on ad-hoc clusters.
A session looks like this:
$ cd ~/p/tb/work
$ git status --short
M src/lsm/forest.zig
$ box create 3
108.129.172.206,52.214.229.222,3.251.67.25
$ box list
0 108.129.172.206
1 52.214.229.222
2 3.251.67.25
$ box sync 0,1,2
$ box run 0 pwd
/home/alpine/p/tb/work
$ box run 0 ls
CHANGELOG.md LICENSE README.md build.zig
docs/ src/ zig/
$ box run 0,1,2 ./zig/download.sh
Downloading Zig 0.14.1 release build...
Extracting zig-x86_64-linux-0.14.1.tar.xz...
Downloading completed (/home/alpine/p/tb/work/zig/zig)!
Enjoy!
$ box run 0,1,2 \
./zig/zig build -Drelease -Dgit-commit=$(git rev-parse HEAD)
$ box run 0,1,2 \
./zig-out/bin/tigerbeetle format \
--cluster=0 --replica=?? --replica-count=3 \
0_??.tigerbeetle
2026-01-20 19:30:15.947Z info(io): opening "0_0.tigerbeetle"...
$ box destroy 0,1,2
I like this! Haven’t used in anger yet, but this is something I wanted for a long time, and now I have it
The problem with implementing above is that I have zero practical experience with modern cloud. I only created my AWS account today, and just looking at the console interface ignited the urge to re-read The Castle. Not my cup of pu-erh. But I had a hypothesis that AI should be good at wrangling baroque cloud API, and it mostly held.
I started with a couple of paragraphs of rough, super high-level description of what I want to get. Not a specification at all, just a general gesture towards unknown unknowns. Then I asked ChatGPT to expand those two paragraphs into a more or less complete spec to hand down to an agent for implementation.
This phase surfaced a bunch of unknowns for me. For example, I wasn’t thinking at all that I somehow need to identify machines, ChatGPT suggested using random hex numbers, and I realized that I do need 0,1,2 naming scheme to concisely specify batches of machines. While thinking about this, I realized that sequential numbering scheme also has an advantage that I can’t have two concurrent clusters running, which is a desirable property for my use-case. If I forgot to shutdown a machine, I’d rather get an error on trying to re-create a machine with the same name, then to silently avoid the clash. Similarly, turns out the questions of permissions and network access rules are something to think about, as well as what region and what image I need.
With the spec document in hand, I turned over to Claude code for actual implementation work. The first step was to further refine the spec, asking Claude if anything is unclear. There were couple of interesting clarifications there.
First, the original ChatGPT spec didn’t get what I meant with my
“current directory mapping” idea, that I want to materialize a local
~/p/tb/work as remote ~/p/tb/work, even if
~ are different. ChatGPT generated an incorrect
description and an incorrect example. I manually corrected
example, but wasn’t able to write a concise and correct description.
Claude fixed that working from the example. I feel like I need to
internalize this more — for current crop of AI, examples seem to be
far more valuable than rules.
Second, the spec included my desire to auto-shutdown machines once I no longer use them, just to make sure I don’t forget to turn the lights off when leaving the room. Claude grilled me on what precisely I want there, and I asked it to DWIM the thing.
The spec ended up being 6KiB of English prose. The final implementation was 14KiB of TypeScript. I wasn’t keeping the spec and the implementation perfectly in sync, but I think they ended up pretty close in the end. Which means that prose specifications are somewhat more compact than code, but not much more compact.
My next step was to try to just one-shot this. Ok, this is embarrassing, and I usually avoid swearing in this blog, but I just typoed that as “one-shit”, and, well, that is one flavorful description I won’t be able to improve upon. The result was just not good (more on why later), so I almost immediately decided to throw it away and start a more incremental approach.
In my previous vibe-post, I noticed that LLM are good at closing the loop. A variation here is that LLMs are good at producing results, and not necessarily good code. I am pretty sure that, if I had let the agent to iterate on the initial script and actually run it against AWS, I would have gotten something working. I didn’t want to go that way for three reasons:
- Spawning VMs takes time, and that significantly reduces the throughput of agentic iteration.
- No way I let the agent run with a real AWS account, given that AWS doesn’t have a fool-proof way to cap costs.
- I am fairly confident that this script will be a part of my workflow for at least several years, so I care more about long-term code maintenance, than immediate result.
And, as I said, the code didn’t feel good, for these specific reasons:
- It wasn’t the code that I would have written, it lacked my character, which made it hard for me to understand it at a glance.
- The code lacked any character whatsoever. It could have worked, it wasn’t “naively bad”, like the first code you write when you are learning programming, but there wasn’t anything good there.
- I never know what the code should be up-front. I don’t design solutions, I discover them in the process of refactoring. Some of my best work was spending a quiet weekend rewriting large subsystems implemented before me, because, with an implementation at hand, it was possible for me to see the actual, beautiful core of what needs to be done. With a slop-dump, I just don’t get to even see what could be wrong.
- In particular, while you are working the code (as in “wrought iron”), you often go back to requirements and change them. Remember that ambiguity of my request to “shut down idle cluster”? Claude tried to DWIM and created some horrific mess of bash scripts, timestamp files, PAM policy and systemd units. But the right answer there was “lets maybe not have that feature?” (in contrast, simply shutting the machine down after 8 hours is a one-liner).
The incremental approach worked much better, Claude is good at
filling-in the blanks. The very first thing I did for box-v2 was manually typing-in:
type CLI =
| CLICreate
| CLIDestroy
| CLIList
| CLISync
type BoxList = string[];
type CLICreate = { tag: "create"; count: number };
type CLIDestroy = { tag: "destroy"; boxes: BoxList };
type CLIList = { tag: "list" };
type CLISync = { tag: "sync"; boxes: BoxList; };
function fatal(message: string): never {
console.error(message);
Deno.exit(1);
}
function CLIParse(args: string[]): CLI {
}
Then I asked Claude to complete the CLIParse function,
and I was happy with the result. Note
Show, Don’t Tell
I am not asking Claude to avoid throwing an exception and
fail fast instead. I just give fatal
function, and it code-completes the rest.
I can’t say that the code inside CLIParse is
top-notch. I’d probably written something more spartan. But the
important part is that, at this level, I don’t care. The abstraction
for parsing CLI arguments feel right to me, and the details I can
always fix later. This is how this overall vibe-coding session
transpired — I was providing structure, Claude was painting by the
numbers.
In particular, with that CLI parsing structure in place, Claude had
little problem adding new subcommands and new arguments in a
satisfactory way. The only snag was that, when I asked to add an
optional path to sync, it went with string |
null, while I strongly prefer string |
undefined. Obviously, its better to pick your null in
JavaScript and stick with it. The fact that undefineed
is unavoidable predetermines the winner. Given that the argument was
added as an incremental small change, course-correcting was trivial.
The null vs undefined issue perhaps illustrates my complaint about
the code lacking character.
| null is the default non-choice. |
undefined is an insight, which I personally learned from VS
Code LSP implementation.
The hand-written skeleton/vibe-coded guts worked not only for the CLI. I wrote
async function main() {
const cli = CLIParse(Deno.args);
if (cli.tag === "create") return await mainCreate(cli.count);
if (cli.tag === "destroy") return await mainDestroy(cli.boxes);
...
}
async function mainDestroy(boxes: string[]) {
for (const box of boxes) {
await instanceDestroy(box);
}
}
async function instanceDestroy(id: string) {
}
and then asked Claude to write the body of a particular function according to the SPEC.md.
Unlike with the CLI, Claude wasn’t able to follow this pattern
itself. With one example it’s not obvious, but the overall structure
is that instanceXXX is the AWS-level operation on a
single box, and
mainXXX is the CLI-level control flow that deals with
looping and parallelism. When I asked Claude to implement box
run, without myself doing the main /
instance split, Claude failed to noticed it and needed
a course correction.
However, Claude was massively successful with the actual logic. It would have taken me hours to acquire specific, non-reusable knowledge to write:
const instanceMarketOptions = JSON.stringify({
MarketType: "spot",
SpotOptions: { InstanceInterruptionBehavior: "terminate" },
});
const tagSpecifications = JSON.stringify([
{ ResourceType: "instance", Tags: [{ Key: moniker, Value: id }] },
]);
const result = await $`aws ec2 run-instances \
--image-id ${image} \
--instance-type ${instanceType} \
--key-name ${moniker} \
--security-groups ${moniker} \
--instance-market-options ${instanceMarketOptions} \
--user-data ${userDataBase64} \
--tag-specifications ${tagSpecifications} \
--output json`.json();
const instanceId = result.Instances[0].InstanceId;
await $`aws ec2 wait instance-status-ok --instance-ids ${instanceId}`;
I want to be careful — I can’t vouch for correctness and especially completeness of the above snippet. However, given that the nature of the problem is such that I can just run the code and see the result, I am fine with it. If I were writing this myself, trial-and-error would totally be my approach as well.
Then there’s synthesis — with several instance commands implemented, I noticed that many started with querying AWS to resolve symbolic machine name, like “1”, to the AWS name/IP. At that point I realized that resolving symbolic names is a fundamental part of the problem, and that it should only happen once, which resulting in the following refactored shape of the code:
async function main() {
const cli = CLIParse(Deno.args);
const instances = await instanceMap();
if (cli.tag === "create") return await mainCreate(instances, cli.count);
if (cli.tag === "destroy") return await mainDestroy(instances, cli.boxes);
...
}
Claude was ok with extracting the logic, but messed up the overall code layout, so the final code motions were on me. “Context” arguments go first, not last, common prefix is more valuable than common suffix because of visual alignment.
The original “one-shotted” implementation also didn’t do up-front querying. This is an example of a shape of a problem I only discover when working with code closely.
Of course, the script didn’t work perfectly the first time and we needed quite a few iterations on the real machines both to fix coding bugs, as well gaps in the spec. That was an interesting experience of speed-running rookie mistakes. Claude made naive bugs, but was also good at fixing them.
For example, when I first tried to box ssh after box create, I got an error. Pasting it into Claude
immediately showed the problem. Originally, the code was doing
aws ec2 wait
instance-running
and not
aws ec2 wait
instance-status-ok.
The former checks if instance is logically created, the latter waits until the OS is booted. It makes sense that these two exist, and the difference is clear (and its also clear that OS booted != SSH demon started). Claude’s value here is in providing specific names for the concepts I already know to exist.
Another fun one was about the disk. I noticed that, while the
instance had an SSD, it wasn’t actually used. I asked Claude to
mount it as home, but that didn’t work. Claude immediately asked me
to run
$ box run 0 cat
/var/some/unintuitive/long/path.log
and that log immediately showed the problem. This is remarkable! 50%
of my typical Linux debugging day is wasted not knowing that a
useful log exists, and the other 50% is for searching for the log I
know should exist somewhere.
After the fix, I lost the ability to SSH. Pasting the error
immediately gave the answer — by mounting over /home,
we were overwriting ssh keys configured prior.
There were couple of more iterations like that. Rookie mistakes were made, but they were debugged and fixed much faster than my personal knowledge allows (and again, I feel that is trivia knowledge, rather than deep reusable knowledge, so I am happy to delegate it!).
It worked satisfactorily in the end, and, what’s more, I am happy to maintain the code, at least to the extent that I personally need it. Kinda hard to measure productivity boost here, but, given just the sheer number of CLI flags required to make this work, I am pretty confident that time was saved, even factoring the writing of the present article!
I’ve recently read The Art of Doing Science and Engineering by Hamming (of distance and code), and one story stuck with me:
A psychologist friend at Bell Telephone Laboratories once built a machine with about 12 switches and a red and a green light. You set the switches, pushed a button, and either you got a red or a green light. After the first person tried it 20 times they wrote a theory of how to make the green light come on. The theory was given to the next victim and they had their 20 tries and wrote their theory, and so on endlessly. The stated purpose of the test was to study how theories evolved.
But my friend, being the kind of person he was, had connected the lights to a random source! One day he observed to me that no person in all the tests (and they were all high-class Bell Telephone Laboratories scientists) ever said there was no message. I promptly observed to him that not one of them was either a statistician or an information theorist, the two classes of people who are intimately familiar with randomness. A check revealed I was right!