100人团队的持续集成 (CI) 实践 (PostHog)

100人团队的持续集成 (CI) 实践 (PostHog)
What CI looks like at a 100-person team (PostHog)

原始链接: https://www.mendral.com/blog/ci-at-scale

## Mendral：用于大规模CI的AI PostHog的CI系统规模庞大——每周近576,000个任务，11.8亿行日志和3300万次测试，这表明其工程团队生产力高且发展迅速。然而，即使99.98%的通过率也会产生大量噪音并浪费工程时间，原因在于不稳定的测试和调查。为了解决这个问题，PostHog与Mendral的创建者合作，Mendral是一种旨在诊断CI失败、隔离不稳定测试并通过拉取请求自动提出修复方案的AI代理。Mendral基于十年前扩展Docker CI的经验教训构建，旨在解决现代CI/CD管道日益复杂的挑战，尤其是在AI辅助编码兴起的情况下。 Mendral通过摄取和分析大量的日志数据来工作，追踪不稳定的根源，并主动通过Slack通知相关工程师。关键的学习成果包括快速日志摄取的重要性、大多数不稳定测试的确定性以及智能故障路由的价值。团队强调，挑战并未减少；AI编码工具正在*增加*代码速度和CI负载。Mendral旨在帮助团队在数量增长时保持速度和生产力，目前提供早期访问权限。

## PostHog 的 CI 挑战（Hacker News 讨论）一篇关于 PostHog 在 100 人规模下的 CI/CD 流程的文章引发了 Hacker News 的讨论。核心问题并非仅仅是速度，而是管理 22,477 个测试、每天 65 次提交到主分支以及支持 98 名工程师时的*可靠性*。许多评论者质疑在没有明确结论的情况下呈现大量统计数据是否有价值，认为这仅仅是为了推广作者开发的工具 Mendral。人们对单仓库可能成为瓶颈的潜力表示担忧，以及分割仓库是否只是转移问题而不是解决问题。讨论的中心是测试的不稳定性：一些人认为必须达到 100% 通过率，而另一些人则承认在复杂系统中这是不切实际的。一个关键点是解决不稳定的测试，而不是隔离它们。几位用户批评文章的写作风格是“AI 垃圾”，并质疑日益依赖 AI 生成的代码和测试，对人类监督和潜在的细微错误表示担忧。最终，这场对话凸显了 CI/CD 随着快速发展的代码库一起扩展所面临的挑战。

原文

Last week, PostHog's CI ran 575,894 jobs, processed 1.18 billion log lines, and executed 33 million tests. At that volume, even a 99.98% pass rate generates meaningful noise. We're building Mendral, an AI agent that diagnoses CI failures, quarantines flaky tests, and opens PRs with fixes. Here's what we've learned running it on one of the largest public monorepos we've seen.

But first, some context on why we're building this. Back in 2013, I was managing Docker's engineering team of 100 people, and my co-founder Andrea was the team's tech lead and lead architect. We scaled Docker's CI from a handful of builds to something that ran constantly across a massive contributor base. Even back then, flaky tests and large PRs that were painful to review were our biggest headaches. We spent a disproportionate amount of time debugging CI failures that had nothing to do with the code being shipped. That was over a decade ago, with no AI coding tools generating PRs at scale. The problem has only gotten worse since then. Mendral is the tool we wished we had at Docker.

PostHog's CI in one week

PostHog is a ~100 person fully remote team, all pushing constantly to a large public monorepo. They ship fast, and their CI infrastructure reflects that. Here's what one week looks like (Jan 27 - Feb 2):

575,894 CI jobs across 94,574 workflow runs
1.18 billion log lines (76.6 GiB)
33.4 million test executions across 22,477 unique tests
3.6 years of compute time in a single week
65 commits merged to main per day, 105 PRs tested per day
98 human contributors active in one week
Every commit to main triggers an average of 221 parallel jobs

PostHog commit duration trend over one week

On their busiest day (a Tuesday), they burned through 300 days of compute in 24 hours. These are not the numbers of a team with a CI problem. These are the numbers of a team that moves extremely fast and takes testing seriously.

The physics of scale

At this velocity, flaky tests become an inevitable force of nature. PostHog's test pass rate is 99.98%, which is genuinely excellent across 22,477 tests. But at 33 million weekly test executions, even a tiny failure rate produces a meaningful amount of noise. That's just math.

About 14% of their total compute goes to failures and cancellations, and roughly 3.5% of all jobs are re-runs. This isn't a PostHog-specific issue. Any team operating at this pace, with this test coverage, will hit the same dynamics. The question is how you deal with it.

Most teams just live with it. Engineers learn which tests are flaky, they re-run, they move on. It works up to a point. But when you have 98 active contributors and 221 parallel jobs per commit, the overhead of investigating and re-running adds up quietly. The PostHog team recognized this early and decided to be proactive about it, which is how we started working together.

What actually happens with flaky tests at scale

A test that passes 95% of the time sounds mostly fine. But when your CI runs 221 jobs per commit and you're merging 65 commits per day to main, that test is failing multiple times daily. It's not the individual failure that hurts. It's the investigation. Someone sees red CI, drops what they're doing, opens the logs, tries to figure out if their change broke something or if it's a known flake. Then they re-run. Maybe it passes. Maybe it doesn't. Maybe they ping someone on Slack. Maybe three people are looking at the same failure independently.

At 10 engineers, a flaky test is annoying. At 100, it's a tax on everyone's productivity.

How Mendral works at PostHog

Our agent is a GitHub App. You install it on your repo (takes about 5 minutes), and it starts watching every commit, every CI run, every log output. Here's what it does concretely:

Ingesting logs at scale. We built a log ingestion system that processes PostHog's billion-plus weekly log lines so the agent can search and correlate failures quickly. Without this, you can't do meaningful diagnosis. You need to be able to look at a failure, pull the relevant logs, and cross-reference with other recent failures to determine if it's a flake or a real regression.

PostHog 7-day CI log volume

Detecting and tracing flakes. The agent correlates errors with code changes. When it sees a test failing intermittently, it traces the flake back to its origin: which commit introduced it, which infrastructure condition triggers it (sometimes it's infra slowness or side effects, not code). This is the part that takes a human the longest, and where the agent adds the most value.

Opening PRs with fixes. When the agent identifies a flaky test and has enough confidence in the diagnosis, it opens a pull request to quarantine or fix it. Because PostHog's repo is public, these PRs are visible to anyone. You can go look at them right now. The agent iterates on the PR based on review comments, just like a human would.

Mendral PR on PostHog's public repo: sharding Playwright E2E tests into parallel jobs

Acting as a team member on Slack. The agent joins Slack and behaves like a team member. When there's a CI failure, it doesn't broadcast to a general channel. Every member of the PostHog team linked their account to the Mendral dashboard, so the agent knows who to involve. If your commit likely caused the failure, you get a direct message. If it's a known flake, the agent handles it without interrupting anyone. This was something the PostHog team pushed us to build better, and it's become one of the most impactful features.

Continuous analysis. The agent doesn't just react to failures. It analyzes all commits, all CI runs, and all logs continuously. It builds a picture of the repo's health over time, identifies patterns, and surfaces insights proactively.

Four things that surprised us building a CI agent at scale

A few things that surprised us:

Log ingestion is the hard problem. Everyone focuses on the AI/diagnosis part, but the bottleneck is actually getting billions of log lines indexed and searchable fast enough to be useful in real time. If your agent can't search logs in seconds, it can't diagnose anything before the engineer has already context-switched to investigate manually.

Flaky tests are rarely random. Almost every "flaky" test has a deterministic root cause, just one that's hard to find. Timing dependencies, shared state between tests, infrastructure variance, order-dependent execution. The agent is good at this because it can correlate across hundreds of CI runs simultaneously, something a human wouldn't have the patience to do.

The routing problem is underappreciated. Knowing who to notify about a failure is almost as valuable as knowing what failed. At PostHog's scale, a failure notification going to a general channel means 98 people glance at it and 97 ignore it. Having the agent figure out who actually needs to look at it, based on the code change and the failure signature, removes a lot of noise.

Working on a public repo keeps you honest. Every PR our agent opens on PostHog's repo is visible to anyone. This has been great for us because it forces total transparency. You can see exactly what the agent is doing, how it reasons about failures, and what fixes it proposes.

The bigger picture

We think the CI challenge is going to grow for most teams, not shrink. AI coding tools (Cursor, Copilot, Claude Code) are increasing the volume of code changes. More changes means more CI runs, more potential failures, more flaky tests surfacing. The delivery pipeline becomes the bottleneck.

PostHog is a good example of what a well-run, fast-moving team looks like at scale. They have 22,477 tests with a 99.98% pass rate, ship 65 commits to main daily, and keep 98 engineers productive on a single monorepo. That's impressive engineering. Our job is to help teams like theirs stay fast as the volume keeps growing.

We got lucky onboarding PostHog early in our YC batch. Their scale pushed the limits of our agent fast, in ways we couldn't have simulated ourselves. It forced us to solve real problems at real volume from day one. A big thank you to Tim for his trust and support, and to the whole PostHog team for working with us in the open on a public repo and giving us the feedback that's shaped a lot of what Mendral does today. Having spent a decade building and scaling CI systems ourselves, starting at Docker in 2013 when the problem was already hard, it's been exciting to finally build the agent that automates the work we used to do manually.

We're Sam and Andrea. I was the first hire at Docker and later VP of Engineering; Andrea wrote Docker's first commit. If your CI looks anything like this, we'd love to look at your numbers.

Request early access.