十年生产部署经验

十年生产部署经验
Ten years of deploying to production

原始链接: https://brandonvin.github.io/2026/03/04/ten-years-of-deploying-to-production.html

2018年，作者就职于一家采用传统、孤立方式进行软件部署的公司。一个独立的“运维”团队负责生产环境，仅每两周部署一次代码——这对于数据科学团队发现的问题修复来说是一个主要瓶颈，该团队负责构建机器学习模型。修复部署通常取决于运维团队的可用性和运气。作者面临的挑战是：模型在生产环境中出现故障，需要只有运维团队才能实施的更新。现有的流程极其手动，缺乏版本控制和代码审查。为了解决这个问题，作者发起了一项“DevOps”倡议，与工程和运维团队合作，构建了一个内部PyPi仓库，并使用Chef自动化部署。这包括创建一个带有版本标记的可重复部署流程，并建立基本代码审查机制。该解决方案成功解决了客户问题，并凸显了一个根本性的转变：从一个专注于*保护*生产环境的运维团队，到一种优先*加速*开发并使生产环境具有弹性的现代平台工程方法，重点是开发者体验和快速迭代。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录十年生产环境部署经验 (brandonvin.github.io) 5 分，来自 mooreds 2小时前 | 隐藏 | 过去 | 收藏 | 讨论帮助指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

Back in 2018, where I worked there was an operations team. “Ops”, we called them. In that decade, this company was behind the curve, but not far from typical. We were just starting to think about AWS. At the tail end of my time there, we were just starting to adopt AWS for some internal-only systems. But from what I’ve heard from friends who worked at more mature companies, it wasn’t uncommon in that era to have an operations team that owned production.

Funny thing: the ops team literally sat in a corner of the office, in their own room. That’s where ops is, in that little room. It sounds like a meme.

The ops team had a nice tool to spin up a VM inside the company’s infrastructure. I appreciated that – my whole team used it all the time. I needed to train recurrent neural networks using GPUs and 20+ gigabytes of RAM. No way that was going to run on my laptop, so this workflow was invaluable to my work.

Here’s the big catch: production deployments happened once every two weeks. Full stop.

If something went wrong, the deployment had to wait another two weeks. Unless you were lucky: if the current ops on the weekly rotation was particularly nice, and not dealing with evening plans, and if you were online to respond to their questions, you could push through and fix that random error that only happens in production.

From time to time, I would wander into the ops corner and chat with people about strange issues my team saw in the production database, in our latest attempt to deploy to production, and so on.

The production deployment challenge

My team was fundamentally a data science team. We were training ML models, building and running data pipelines to collect training data and train models on the latest data. All Python code. That’s all fine.

There was a big problem: the models in production were misbehaving, and customers were noticing:

Your API returned this classifier result. That makes no sense. Why?

It would get sent to our team of analysts, and eventually my team to figure out. After a long back-and-forth, we’d conclude there’s probably a deficiency in the training data, the model, or business logic wrapping the model.

Okay, how do we fix this? We need to fix the model, or update the business logic wrapping the model. And then what? We can test it and deploy it to production. Okay, but we can’t deploy to production because only engineers and ops can do that. What? How do they … deploy to production? Go talk to them.

This became my problem to solve, perhaps heroically.

To give you a picture of where we were at, reviewing PRs was… simply not a thing. We worked out of GitHub repositories that were largely just snapshots of the current code. That’s great - it’s a backup of our code. In the ideal case, we’d push to master, ssh to an internal VM, pull the code, and run it. In the normal case, we’d ssh to the VM and edit things there, rerun the models, copy the code from the VM into the GitHub repository if we remember.

How do we build a new model and deploy it to production? Who knows. Maybe other people’s problem, but that wasn’t a term I knew back then.

The production deployment solution

This post isn’t about my heroism, so to keep it brief, I leaned hard into what I understood as “DevOps”:

Went and talked to the engineering teams and the ops team. Figured out how everything fits together.
Learned what the heck Chef is.
Wrote and deployed an internal PyPi repository that used git tags as versions, and resolved dependencies using our internal GitHub repositories. With the support of a partner in the ops team. Not a ton of code - maybe 100 lines of Python - but deeply embedded and tough to test and deploy on my own.
Established a pattern of not just pushing to master, but also tagging versions to release, and sometimes reviewing code in PRs before merging to master
Created a Chef recipe template for Python apps
Created a Chef recipe for our Python app
Deployed the thing to production, and fixed the customer’s concerns!

Since then

Since then, I’ve spent most of my time the other side of the table, so to speak. Not to mention in a completely different company with different values, and a more flexible organization.

I’ve reflected a bit on what makes 2018 different from 2026.

2018: Production Operations Team

We are the production operations team. Our mission is to protect production.

If a developer wants a change in production, it should be possible, but we need to cover our butts. There’s going to be a handoff from the developers to us. Make a full paper trail with tickets.

If production is too hard to change for a regular developer - well, that sucks, but it kind of works in our favor. Any deviation from what’s in that ticket is - by default - a liability for us. Nobody has time for debugging a random issue that happens “only in production”. We have lives outside of work!

From time to time, a hero might push through the friction and make a self-service path. But again, that’s a hero, and heroes are few and far between. They also might quickly move on to another company where change is easier.

2026: Platform Engineering Team

Our mission is to accelerate development and make production resilient.

Developer experience has to be smooth. CI/CD has to be quick. If a developer is ever waiting for CI or CD, treat that as a mini-incident.

When a problem is discovered in production, it should be obvious what the problem was, and the internal signals for developers should make it easy to diagnose and fix.

It still feels early to be prescriptive on the how and what, so it’s worth another post in the future.