没人因为优步800万美元账本错误被解雇?
Nobody Got Fired for Uber's $8M Ledger Mistake?

原始链接: https://news.alvaroduran.com/p/nobody-got-fired-for-ubers-8-million

## Uber 的账本教训:代价高昂的重写系列 Uber 在十年内完全重建了其账本系统五次,通常是因为激励机制存在缺陷,优先考虑令人印象深刻的项目而非具有成本效益的解决方案。一个特别昂贵的例子是 2017 年迁移到 DynamoDB。虽然 DynamoDB 在高吞吐量支付方面表现出色,但其基于消耗的定价模式对于需要大量读/写操作的账本来说是灾难性的——最终每年给 Uber 造成了约 800 万美元的成本。 该公司被迫限制在 DynamoDB 中的数据存储,并在 DocStore 之上构建了一个内部解决方案 LSG。这涉及进一步的开发,包括一个新的流式框架,尽管存在可行的替代方案。 作者认为整个事件凸显了一个关键缺陷:未能优先考虑成本优化。尽管存在明显的问题,Uber 在 AWS re:Invent 上将 DynamoDB 的实施呈现为成功案例,这一说法被 ByteByteGo 等出版物延续。 核心要点?技术是不够的。工程师必须在考虑技术需求的同时,考虑经济影响,并且激励机制应该奖励务实的解决方案,而不仅仅是复杂的解决方案。这个案例为支付工程师提供了一个警示故事,强调了“速算”和对系统成本的整体视图的重要性。

黑客新闻 新的 | 过去的 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 没人因为优步的800万美元账本错误而被解雇? (alvaroduran.com) 13 分,由 ohduran 1小时前 | 隐藏 | 过去的 | 收藏 | 讨论 帮助 考虑申请YC 2026年夏季项目!申请截止至5月4日 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系方式 搜索:
相关文章

原文

Uber has rewritten its ledger systems five times in the last ten years. And at least one of those rewrites, if not all, could have been avoided.

That’s because the root of each generation of money software at Uber was driven from bad incentives. Each started with a brand new proposal, approved as the definitive solution; in time, a fatal flaw was surfaced; and finally, a new proposal came along to replace it.

Every rewrite was someone’s promotion project.

At least one of them could’ve been avoided: the one where Uber moved to DynamoDB. In 2017, Uber launched their new payment platform on it, and the critical factor that everyone involved seemed to miss was that DynamoDB is a consumption-priced database.

You pay for every read, and every write.

With each trip generating multiple ledger entries, and Uber as a whole processing 15 million trips per day, it didn’t matter that DynamoDB was great because of high throughput at global scale. The proverbial bean counter should’ve stopped this madness from happening.

Within 2 years, the cost became prohibitely expensive:

At Uber’s scale, DynamoDB became expensive. Hence, we started keeping only 12 weeks of data (i.e., hot data) in DynamoDB and started using Uber’s blobstore, TerraBlob, for older data (i.e., cold data). TerraBlob is similar to AWS S3. For a long-term solution, we wanted to use LSG.

Migrating a Trillion Entries of Uber’s Ledger Data from DynamoDB to LedgerStore

A redesign that gets replaced 2 years later is a catastrophe.

And yet, history remembers Uber’s ledger on top of DynamoDB as a masterpiece. As late as 2024, ByteByteGo has an article praising it.

And that’s what concerns me. Uber’s design was a failure, but nobody seems to remember it that way.

That ends today.

I’m Alvaro Duran, and this is The Payments Engineer Playbook, the only newsletter on Earth tailor-made for engineers of money software. Every week, more than 2,000 subscribers from companies like Stripe, Coinbase and Modern Treasury get a dive deep on how to build software that moves money around. Not to pass interviews, but to do their job exceptionally well.

When money is on the line, stakes are sky high and the margin for error is razor thin.

In The Payments Engineer Playbook, we investigate the technology that transfers money. All to help you become a smarter, more skillful and more successful payments engineer. And we do that by cutting off one sliver of it and extracting insights from it.

Here’s what you can expect in today’s article:

  • Why DynamoDB works for payments but breaks when you use it as a ledger

  • The napkin math that would have saved Uber millions of dollars

  • And one shocking conclusion from all of this

Enough intro, let’s dive in.

But first: is DynamoDB a bad choice for financial software?

Not necessarily. I’ve already covered DynamoDB as a potential data store for payment systems, and it has many features that are worth it: zero-downtime migrations, low latency for a global audience, and built-in replication and failover.

If you’re accepting payments at scale around the globe, DynamoDB is a great choice.

It is because DynamoDB, while not enforcing full linearizability across partitions, can guarantee consistency within a Region. DynamoDB gives you strong consistency on a per-partition basis, but not across partitions — and for a global-scale payments system, that’s a trade-off worth making.

PostgreSQL can; DynamoDB cannot.

This is quite a property when it comes to payments, because they are independent from each other. You can interleave the authentication of one with the capture of another. There’s no need to maintain full linearizability across your data; causal consistency is enough. DynamoDB trades off the linearizability that you don’t actually need for all those nice features I mentioned earlier, which means that for large enterprises that serve customers all over the world in high volume and frequency, DynamoDB is better than PostgreSQL.

But a ledger isn’t a payments system.

A ledger cannot simply say “hey, this account and that account can be dealt with independently”. The scope of a ledger system is The World; a data store that can’t enforce full linearizability isn’t going to cut it, no matter how good at throughput and latency it is.

In other words: DynamoDB works well in payments because payments can give up global consistency for better availability. But ledgers can’t give up global consistency, even if that means they get worse availability.

DynamoDB capacity pricing is based on two main models: Provisioned and On-demand. You can buy reads and writes in bulk, buy reads and writes when you need them, or both.

And then, there’s the storage and add-on features. For most, throughput is the real deal, and storage is just the cream on top of the invoice. But for data-heavy applications, such as ledgers, storage can become the dominant cost. That’s why reserving capacity is important, but it demands from you some predictive abilities, or at least making an educated guess about how much you’re going to need in advance, because you’re going to get a discount of more than 50 percent if you do it right.

If you use DynamoDB at scale, you need to do some napkin math.

In 2017, Uber was doing around 11 million trips a day. Assuming 10 entries per trip and 5 WCUs per entry, that’s 550 million writes per day, and at $1.25 per million writes, that’s $687 per day.

$687 per day doesn’t sound like a lot. But that’s $250K a year, just in writes.

With 3x annual growth, the math is unsustainable: by year 3, we’re talking $2.25 million a year. I don’t have visibility into reads, indexes and global tables, but at Uber’s scale, the read side likely costs as much as the write side.

Which means that Uber was burning 5 million dollars in a freaking ledger.

Based on Uber’s data, by 2020 they had accumulated 1.2 petabytes of data. That, at $0.25 per gigabyte is 300K per month. And assuming the same 3x annual growth from 2017 to 2020 with a final size of 1.2 petabytes, that’s a cumulative cost of $3.5 million.

No wonder they switched to storing only the last 12 weeks of data in Dynamo, and stored the rest on premises.

Add the writes to the storage, and you’re looking at an 8 million dollar bill for a ledger that didn’t need to exist on DynamoDB in the first place.

What do you do with an 8 million dollar bill? You turn it into a case study.

Since 2020, Uber has migrated away from DynamoDB into their own internal ledger, called LSG (Ledger Store...Gateway?), built on top of their own internal distributed database called DocStore:

Docstore is a general-purpose multi-model database that provides a strict serializability consistency model on a partition level and can scale horizontally to serve high volume workloads. Features such as Transaction, Materialized View, Associations, and Change Data Capture combined with modeling flexibility and rich query support, significantly improve developer productivity, and reduce the time to market for new applications at Uber.

Evolving Schemaless into a Distributed SQL Database

Why not use an open-source alternative? Because Uber builds in-house. That’s what Uber does.

You could argue that DocStore provided the features they needed in a way no other alternative could. But you would be wrong!

Our homegrown Docstore was a perfect match for our database requirements, except for Change Data Capture (CDC) a.k.a., streaming functionality. [...] We decided to build a streaming framework for Docstore (project name “Flux”) and used that for LedgerStore’s Manifest generation.

How Uber Migrated Financial Data from DynamoDB to Docstore

So let me get this straight: DynamoDB was a bad choice because it was expensive, which is something you could have figured out in advance. You then decided to move everything to an internal data store that had been built for something else, that was available when you decided to build on top of DynamoDB. And that internal data store wasn’t good on its own, so you had to build a streaming framework to complete the migration.

And nobody got fired for this?

But nobody was optimizing for cost. They were optimizing for their next promotion. Each rewrite was a new proposal, a new design doc, a new system to put on a resume. The incentive was never to pick the boring, correct choice — it was to pick the complex, impressive one.

This isn’t Metaverse-levels of disaster, but relative to Uber’s scale, gets pretty close.

What bothers me the most is this: by 2019, it was painfully obvious that Uber had made a terrible decision when they built LedgerStore on top of DynamoDB.

And yet, when AWS invited Uber to present at re:Invent 2019, they said yes.

I’ve written about Uber’s testing practices before, and praised them for it — the article hit the front of Hacker News.

But let’s call a spade a spade: when you actively disguise an atrocious decision as a case study for a database technology, you’re no less fraudulent than one of those hedge fund managers talking their book on TV.

It is the technological equivalent of an arsonist writing a fire safety manual.

On a second level, there’s the publications that regurgitated this case study without looking at the full picture: ByteByteGo has an article on LedgerStore praising “The cost savings from this migration” with yearly savings “exceeding $6 million due to reduced spend on DynamoDB”.

I can’t possibly comment on this.

When you’re tasked with building a system of any kind, not just ledgers, the technology is never enough. If you’re building a system that makes the economics of your company impossible, you’re better off not building it.

Focusing solely on the technical requirements, and not seeing the costs, is a disservice to the business that employs you.

Uber didn’t make a ledger mistake. It set the wrong incentives.

And it paid millions of dollars for it.

This was The Payments Engineer Playbook. I’ll see you next week.

Feel free to share this article with a system designer about to make a costly mistake.

Share

联系我们 contact @ memedata.com