我构建了 Vector。现在我将回答您的可观测性供应商不会回答的问题。

我构建了 Vector。现在我将回答您的可观测性供应商不会回答的问题。
I built Vector. Now I'm answering the question your observability vendor won't

原始链接: https://usetero.com/blog/the-question-your-observability-vendor-wont-answer

## 可观测性的虚假承诺作者在可观测性领域深耕十年，包括创立并退出日志平台 Timber.io（现 Vector），对此深感失望。尽管行业投入巨大并不断创新，但仍然面临成本失控和供应商与用户目标不一致的根本问题。核心问题是：**可观测性已经变成了管理*成本*，而不是获取*洞察*。** 团队花费大量时间监控日志和指标，担心意外账单，并不断与将收入置于客户成功之上的供应商谈判。令人震惊的是，**高达 40% 的可观测性数据是浪费**，但供应商却避免量化这一点，而是提供昂贵的解决方案来分析噪音。这不是缺乏工具，而是缺乏*理解*。作者最近的工作揭示了多个组织中普遍存在的这种浪费，并由此开发了一个系统，可以自动识别和过滤不必要的数据。结果？账单减少，管道简化，工程师专注于有意义的信号，而不是噪音。作者认为可观测性的未来在于**理解数据，而不是收集更多的数据**，并建立基于共同成功的供应商关系——这是他们现在正在与新公司 Tero 一起追求的未来。

Binarylogic，Vector的创建者和前Datadog员工，正在推出Tero (usetero.com) 以解决他认为当前可观测性领域存在的根本缺陷。他认为，大约40%的收集到的可观测性数据被浪费了，导致企业不必要的成本。 Hacker News上的讨论强调了这一痛点，一位评论员指出，即使进行了广泛的监控，也常常无法确定根本原因，最终仍然需要原始开发人员的专业知识。另一位前同事感谢了Binarylogic，并证实自己在初创公司的基础设施和可观测性岗位上经历过相同的成本效率低下问题。 Binarylogic的动力源于他在可观测性领域十年经验，旨在通过Tero提供更高效和集中的解决方案。然而，一位用户报告说，在发布时网站链接已损坏。

原文

This year marks a decade for me in observability.

I left my engineering job in 2016 to start Timber.io, a hosted logging platform, because I thought logs could be simple and great. Timber became Vector. Vector got mass adoption. It got acquired, and I stayed for three years.

And somewhere along the way, the optimism curdled.

I'm not a cynical person. I believed observability could make engineers' lives better. But after a decade, after hundreds of conversations with teams bleeding money across every major vendor, after hearing firsthand how their vendors strong-armed them instead of helping; I've seen enough. The whole industry has lost the plot.

Does any of this sound familiar?

You run observability at your company. But really, you're the cost police. You wake up to a log line in a hot path, a metric tag that exploded cardinality. You chase down the engineer. They didn't do anything wrong, they're just disconnected from what any of this costs. The renewal is always in the back of your mind because mismanaging it reflects poorly on you. Sometimes you catch these mistakes. Sometimes you don't. When you don't, you crawl to your rep asking for forgiveness. Maybe they help the first time, even the second. By the fourth or fifth, they stop. "It's your data." But even with the mistakes, if you're diligent, checking dashboards, staying on top of things, you manage to stay under your commit and avoid an early renewal. But the renewal still gives you a black eye: 40% higher than last year. Your budget didn't grow that much. So you consider switching vendors, but asking your engineers to frantically migrate dashboards, alerts, and change workflows is a distraction that also reflects poorly on you. You're in a lose-lose situation. So you go back to your vendor and ask them to help. You championed them internally; brought them six, seven figure business. Surely they'd return the favor. A slightly bigger discount, help you cut costs by showing you what data is safe to drop. But they don't budge. They could help; they don't.

Case Taintor, Director of Engineering at Klarna, put it all too well:

The most frustrating part of watching your money burn is knowing your supplier could help if they only cared about your long term success.

So why has this gone on for over a decade? Something is deeply wrong if after ten years these same problems not only exist, but have gotten worse.

But what's wrong, exactly? Should your vendor help you? It is your data. They didn't create it. You sent it to them under their pricing model. For years I accepted that framing too. Maybe this is just how it works.

Then I bumped into a question that changed my thinking.

How much of my observability data is waste?

You've asked it. Your vendor has asked it. You know the answer isn't zero. But what is it? 10%? 20%? 40%? At what point does "that's just how it works" stop being an acceptable answer?

You see, anyone who's been in this space knows that cost is far and away the biggest problem. You can take all of the other problems, bundle them together, multiply them by 100, and they still would not surpass cost. It shows up everywhere. All of the "innovation" in observability can be traced back to cost in some way. Pipelines? Cost. Fancy new storages? Cost. OpenTelemetry? Yes, cost.

So in that context, this seems like a pretty important question. Maybe the most important question in observability. Which means it must be unanswerable, right? Because if someone could answer it and let you keep paying for garbage anyway, that would be unconscionable.

Put it to the test. Ask your vendor what percentage of your data is waste. They'll play ignorant. "It's your data." They don't understand it well enough to tell you what's worth keeping. But they understand it well enough to sell you an AI SRE that can "root cause in minutes."

It's this willful ignorance that gets me. Everyone knows what's right but plays the quarterly earnings game instead. Except it's not a game for the people on the other side. I got a front row seat with Vector users. Vector wasn't deployed for fun; it was often deployed in crisis, usually around renewal time when the cost of this game came due. I watched people lose their jobs for "mismanaging" the observability budget. I saw the stress on their faces, the lost sleep.

So when I first bumped into this question while helping a Vector user, and wanted to answer it but couldn't, that's when my optimism curdled.

So I answered it

After I left Vector, the question stayed with me. I took a year off, but Vector users still found me with questions. One in particular jumped out because it was impossible not to: emails, LinkedIn messages, people in my network pinging me on their behalf. I wasn't annoyed. I knew exactly what was going on. So I agreed to help. Except this time, no roadmaps, no one telling me what to do. In exchange, they'd give me access to their data so I could try to answer the question, which I suspected was their actual problem anyway.

So I signed all the docs, got access to their Vector environment, and took a look at their Vector config. It was the mother of all configs (sorry guys, no offense). Dozens of components connected into a complex DAG. Every cost reduction trick in the book: sampling, aggregating, storage tiering, archiving, and a massive list of regexes to match and drop waste. But I wasn't appalled, I respected it. They weren't being careless, they were doing everything they possibly could.

One trick in particular intrigued me: the regex list. It was the bottleneck, but it was also something else: an expression of understanding. Every pattern represented an engineer who understood their service well enough to say "this is waste." My first instinct was to optimize it. I stumbled on Hyperscan. Turns out you can compile tens of thousands of patterns and still match at line rate. That flipped my thinking: what if I took this to the extreme and automated that understanding to produce thousands of patterns?

So I built a system to do exactly that. It compressed billions of logs into thousands of semantic events, each one evaluated with the context it needed: the service, the failure scenarios, the patterns, how it all fits together. (The deep details are outside the scope of this post, but if you're curious, here's how it works today.)

I ran it against the first service: ~40% waste. Another: ~60%. Another: ~30%. On average, ~40% waste.

I knew the number wasn't zero, but I wasn't expecting 40%. So I pressure tested it. Went through hundreds of lines manually. Checked it against their existing patterns. It checked out. With that confidence, I brought it to them.

They laughed. "We can't just drop half of our logs." Fair. But that's not what I was asking. I showed them: this wasn't anything new. It was the same analysis they were already doing, just at scale, more complete, more accurate. Most of their hand-written patterns were already represented in my set, often simpler and faster. They could tweak the analysis, roll it out slowly, push it to teams to take action in their own code.

And that's what happened. The knowledge stopped the bleeding. Over time, services cleaned up their logging. Pipelines got simpler. Bills went down. Not because anyone dropped data recklessly, but because they finally knew what was worth keeping.

Why observability feels broken

The answer to this question isn't just a number. It's the answer to why observability feels broken despite it being more expensive than ever. Think about it.

On the surface: you're paying twice what you should. Cut the waste, cut the bill. Simple.

Go deeper: the cost policing, the weekly dashboard checks, the monthly exercises, the begging your rep for forgiveness when someone's log blows up the bill, the pipelines. All of that exists because you're managing garbage. Half the complexity you've built is dedicated to noise.

Go deeper still: your engineers complain that observability doesn't help them debug faster despite costing millions. Of course it doesn't. They're drowning in noise and calling it data. The alerts fire on garbage. The dashboards are cluttered with garbage. The AI can't find the signal because there's too much garbage in the way.

And underneath all of it: this number shouldn't exist if your vendor was aligned with you.

Take a look around the market. $65M bills. $170M bills. Entire roles for cost control. "Observe without limits." "Stop sampling." "More data, more insight." Dozens of products. It's all backwards. The goal isn't more data, more products, or more complexity.

The goal is understanding with less.

And how do you prove understanding? The question. Either you understand the data well enough to answer it or you don't.

There's a future where you're not the cost cop. Where observability just works. Where your vendor's success depends on yours.

That's the future we're building at Tero.

Get your number.