到目前为止，10 万个并发沙箱教会了我们什么

到目前为止，10 万个并发沙箱教会了我们什么
What 100k concurrent sandboxes has taught us so far

原始链接: https://www.computesdk.com/blog/scale-invitational-update/

旨在测试各提供商如何处理 10 万个并发沙盒的“Scale Invitational”挑战赛已重新安排至 6 月 17 日举行。团队意识到，最初的测试尝试未能提供有意义的数据，这表明在如此规模下，测试基础设施本身也成为了关键变量。早期的测试受到了研究人员自身硬件的限制（网络/IP 层面的瓶颈），且错误地将“吞吐量”（按顺序创建沙盒）而非“真实并发”（同时维持 10 万个沙盒）作为衡量标准。为了确保准确性，团队将测试重新架构为由约 1000 个分片组成的分布式系统。他们不再仅测量简单的创建速度，而是转为追踪沙盒的完整生命周期，从而识别出先前方法会遗漏的持续容量故障。此外，他们还部署了强大的日志聚合和数据管道，以处理海量的遥测数据流，防止数据丢失。这些修正反映了团队对“诚实”基准测试的承诺。通过改进测试架构以将提供商性能与编排干扰分离开来，他们旨在提供一个权威且透明的视角，展示提供商如何处理极端且同步的负载。

Hacker News | 最新 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交 | 登录目前 10 万个并发沙箱教会了我们什么 (computesdk.com) 5 点积分，heygarrison 发布于 2 小时前 | 隐藏 | 过往 | 收藏 | 1 条评论 help greatgib 6 分钟前 [–] 也许是我没看懂背景，又或者这只是一个内容空洞的垃圾自动生成页面。就像那种为了显得公司技术很厉害，而在营销中堆砌各种流行词汇，但实际上根本不知道自己在说什么的页面…… 回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

tl;dr We have moved the Scale Invitational to June 17th after underestimating the complexity of orchestrating and measuring 100,000 sandboxes concurrently.

The Scale Invitational was meant to answer a simple question: “when an app needs to spin up tens of thousands of sandboxes at once, how do providers hold up?”

Our daily benchmark already measures cold-start time, staggered ramp, and small-scale burst behavior across providers. But “small-scale” is the operative word — a few hundred sandboxes from a single runner tells you almost nothing about what happens at scale. So we decided to run a 100,000-sandbox test.

What we did not expect was how much we’d learn before ever running the test.

This post is a candid look at the problems we ran into, the course-corrections, and why we’re deliberately taking our time before publishing a single 100k number.

v1: 10,000 iterations from one very busy VM

The first version was the obvious one. Take the benchmark we already trust, crank up the iteration count, and point it at a single beefy VM. We pushed it up toward 10,000 sandbox creations from one machine and watched it work.

It worked — right up until it didn’t tell us anything useful.

A single VM has a single network stack, a single event loop, and a single egress IP. Long before you reach interesting provider behavior, you start measuring your own machine’s limits. The numbers we got back were as much a portrait of our test rig as of any provider.

That was lesson one: at scale, the test harness becomes part of the experiment. If you can’t rule out your own bottlenecks, you can’t publish.

Spreading the load (aka sharding)

When we shared early v1 results, the consistent feedback was: you’re concentrating load in a way no real workload would, and you’re capping out your own box. A genuine high-concurrency workload is distributed.

So we re-architected around sharding. Instead of one VM doing 10k, a logical burst is split across many VMs, each running a slice of the total. The coordinator on each VM now carries shard metadata so that independently-launched machines know they’re part of one larger run and their results can be stitched back together afterward.

This removed the single-machine ceiling and, just as importantly, made the test look more like the thing we’re actually trying to measure.

Finding the sweet spot: ~100 iterations per shard

Sharding raised an immediate tuning question: how much work per VM?

Too much per shard and you’re back to v1’s problem — each VM becomes its own bottleneck and you’re measuring the rig again. Too little and you’re paying to provision, boot, and coordinate an enormous fleet of VMs for a trivial slice of work each, which is wasteful and adds its own orchestration noise.

After iterating, we landed on roughly 100 iterations per shard/VM. That keeps each individual machine comfortably below its own resource limits — so the numbers reflect the provider, not our hardware — while keeping the fleet size manageable. 100k sandboxes becomes ~1,000 well-behaved shards rather than one machine on fire.

The big one: we weren’t actually testing concurrency

This is the realization that reshaped the whole project.

Our burst test created N sandboxes “at once” and measured how fast each came up. But here’s the subtle flaw: a sandbox that was created and then immediately torn down isn’t concurrent with anything. If you create #1, tear it down, create #2, tear it down, and so on quickly, you’ve measured burst throughput — how fast you can churn through creations — but you have not measured whether a provider can hold 100,000 sandboxes alive at the same time.

Those are completely different questions. “Can you create 100k quickly?” and “Can you sustain 100k live simultaneously?” stress entirely different parts of a provider’s infrastructure — scheduling and admission vs. sustained capacity, memory pressure, and reclamation.

We had been reporting implied concurrency. We wanted true concurrency.

Pursuing true concurrency: keep the sandboxes alive

Measuring true concurrency meant a fundamental change: sandboxes have to stay alive until the whole run reaches peak, then be torn down together. Every shard has to hold its sandboxes open and coordinate so that, at some moment, all 100k are simultaneously live.

That changed what a “result” even means. It more than just “how fast was the sandbox created?” It now has a richer lifecycle, which our result model now captures explicitly:

success — created, became interactive, and was still alive when we performed the coordinated teardown at the end of the test.
partial — created and became interactive, but died somewhere between creation and the coordinated destroy. (This is the category that only a true concurrency test can even detect.)
readiness failed — created, but never became usable.
failed — the creation call itself errored.

That “partial” bucket is the whole point. Under implied-concurrency testing, those sandboxes would have counted as wins — created successfully, torn down, move on. Under true concurrency, we can see which providers create well but struggle to sustain. To make this airtight we also record per-shard liveness over time, so we can reconstruct the real concurrency curve — how many sandboxes were actually alive at each instant — rather than trusting that “created” meant “still running.”

Logs everywhere: getting visibility across a thousand shards

A single VM is easy to debug — you SSH in and tail a log. A thousand shards spread across a fleet is a different animal. When something goes sideways at shard 734, you need to know, and you can’t be SSH-ing into a thousand machines.

So we built log aggregation into the coordinator itself. Each shard captures its own coordinator output and ships it durably to object storage (Tigris) alongside its results, rather than relying on the ephemeral VM’s local disk or the orchestration layer’s log plumbing. The logs survive the VM, and they survive the orchestration layer flaking — which, at this scale, it will.

This matters more than it sounds. Without per-shard logs you can see that a run produced fewer successes than expected, but not why. With them, a post-mortem across the whole fleet becomes possible.

Ingesting a firehose: the data pipeline

100k sandboxes, each with a full lifecycle record, multiplied by repeated runs across multiple providers, is a lot of data — and it arrives from a thousand shards roughly at once.

We landed on a two-store design of Tigris for cold storage and an Clickhouse for analytics.

The principle: never lose raw data, never hold the full result set in memory, and make sure a failure anywhere — a VM, a coordinator, the orchestration layer — costs you seconds of in-flight data at most, not the whole run. If the structured store ever needs reshaping, we can rebuild it from the raw record in Tigris.

Why we’re pushing to June 17th

Every step so far has changed what the benchmark measures. v1 measured our own VM. Implied concurrency measured throughput, not sustained load. Each correction made the test more honest — and revealed the next thing we hadn’t gotten right. We’ve learned something at every juncture, and we fully expect to learn more.

Look out for the final results on the 17th (fingers crossed).