可观测性的过去、现在和未来
Observability's past, present, and future

原始链接: https://blog.sherwoodcallaway.com/observability-s-past-present-and-future/

## 可观测性的演变与未来 本文探讨了可观测性的历史,以及为什么在2026年,尽管投入了大量资金,它常常无法达到预期。它始于2010年代初,是对云原生应用日益复杂的响应——微服务、容器和快速的CI/CD周期,传统监控在这种情况下失效了。分布式追踪(由Google的Dapper率先提出,并被Honeycomb和Jaeger等工具普及)和“可观测性”理念(起源于Twitter)作为解决方案出现。 然而,作者认为,对工具和流程的过度投资——更多的工具化、仪表盘和告警——反而使可观测性变得*更加*困难。尽管拥有丰富的遥测数据,理解应用程序行为和解决事件仍然具有挑战性且耗时。核心问题不是数据*收集*,而是数据*解读*以及将洞察转化为可靠的系统。 展望未来,作者认为可观测性对于应对即将到来的“无限软件危机”至关重要,这场危机是由人工智能和低代码/无代码平台的兴起驱动的,这将极大地增加软件的数量和速度。然而,*当前*的可观测性形式是不够的;需要一种新的方法来管理这种前所未有的规模和复杂性。

## 可观测性:从复杂的设置到人工智能驱动的解决方案 最近在Hacker News上的一场讨论强调了现代可观测性的发展痛点。虽然工具激增——OpenTelemetry、Zipkin、eBPF等等——但真正*有效使用*它们仍然出乎意料地困难。许多人苦于复杂的设置、持续的维护和陡峭的学习曲线,被海量数据(“数据洪流”)压倒。 核心问题不是缺乏信号,而是难以将这些信号转化为可操作的洞察力和更高的可靠性。专家指出,需要更好的数据工程、标准化以及能够弥合领域知识和数据分析之间差距的“侦探”。 一个关键的挑战是激励机制不一致;公司通常只在可观测性影响利润的情况下才优先考虑它。人们越来越希望人工智能能够自动化根本原因分析,甚至提交错误修复,从而减轻负担。一些人提倡更简单的方法,例如VictoriaMetrics堆栈,而另一些人则设想未来进行全时程记录,以便进行详细的调试——尽管成本和合规性仍然是障碍。最终,可观测性的未来可能在于自动化和人工智能驱动的洞察力,使其超越专业角色而变得易于访问。
相关文章

原文

In my last post, Round Two, I wrote about my career, my passion for dev tools, and my decision to start a new company focused on observability.

I also wrote about my frustration with observability as it exists today. Why are the tools and workflows so bad? Why does it feel like so much work?

In today's post, I want to unpack that frustration by looking at observability in historical context. I'll examine observability through a simple three-part lens: its past, present, and future.

My goal is to understand and explain:

  1. Why observability emerged in the first place
  2. How observability evolved into the mess it is today
  3. Why, despite all the progress we've made, maintaining reliable systems is still SO hard in 2026

Let's go!

PART 1: OBSERVABILITY'S PAST

To understand observability, it helps to look at the environment that gave rise to it.

In the early 2010s, software engineers faced a crisis: with the rise of cloud computing, containers, and microservices, apps were becoming increasingly complex - too complex for any individual to fully understand.

Meanwhile, CI/CD was becoming more common, which meant an increase in the rate at which changes were deployed to production. The result: more frequent bugs and outages.

Suddenly, our old reliability playbook stopped working. We couldn't predict all the edge-cases, let alone write tests for them. Failure modes increasingly emerged from the complex interaction between services. When it came to root-cause analysis, logs and basic metrics were no longer enough.

Fortunately, the industry found a solution.

It came in two parts: a tool and a philosophy.

The tool was distributed tracing. After a quiet start in 2010, distributed tracing grew steadily over the ensuing decade until it was essentially ubiquitous:

  • 2010: Google publishes Dapper, the first major paper on distributed tracing.
  • 2012: Twitter introduces Zipkin, inspired by Dapper.
  • 2015: Honeycomb, one of the first managed tracing platforms, is founded.
  • 2016: OpenTracing is adopted by CNCF.
  • 2016: Uber introduces Jaeger, inspired by Zipkin.
  • 2017: Datadog launches APM, their managed tracing solution.
  • 2018: O'Reilly publishes "Distributed Systems Observability".

The philosophy was observability. Originally coined by a rocket scientist in the 1960s, the term "observability" was popularized in the software community by Twitter's engineering team during the early 2010s. It quickly grew into a full-fledged product category and engineering discipline:

  • 2013: Twitter publishes "Observability at Twitter".
  • 2015: Charity Majors (Honeycomb founder) begins writing about observability, formalizing key concepts.
  • 2017: Peter Bourgon (Soundcloud engineer who worked on Prometheus) coins the "Three Pillars".
  • 2022: O'Reilly publishes "Observability Engineering".

In summary, distributed tracing gave engineers a way to debug modern apps, which were cloud-based and distributed by nature, while observability gave them the framework for how to think about reliability in this new operating environment.

My point here is that observability wasn't born in a vacuum. It emerged as a response to real-world engineering challenges. And it persisted because, for the most part, it worked.

But then things took a turn for the worse. Encouraged by early results, engineering teams started to over-invest in observability tools and processes.

More instrumentation. More dashboards. More monitors. SLOs. Error budgets. Runbooks. Postmortems. By the early 2020s, observability was no longer a means to an end. It was an end unto itself.

PART 2: OBSERVABILITY'S PRESENT

Today, most engineers will agree: observability is table stakes. If you're running production systems at almost any scale, you need a way to detect, mitigate, and resolve issues when they occur.

This is the role of modern observability platforms like Datadog, Grafana, and Sentry.

Yet, when I think about the current state of observability, one question stands out: why, after 10+ years of investment in better tools and processes, does observability still suck?

I'm serious!

Think about it.

Instrumentation takes forever. Dashboards are constantly out of date. Monitors misfire. Alerts lack context. On-call is a permanent tax on engineering productivity. Incident resolution can take hours, sometimes days.

We have more telemetry than ever, but getting an accurate mental model of your app in production is still a major challenge - not just for newgrads and vibe-coding Zoomers, but for experienced engineers too.

It's not for lack of effort. At companies of all sizes, engineering teams take observability seriously, investing enormous amounts of time and energy into implementing and maintaining observability systems.

They shell out for Datadog. They instrument EVERYTHING. They adopt structured logging and enforced standardized naming/labeling conventions. They create "golden" dashboards. They painstakingly tune their monitors. By any reasonable definition, these teams are doing everything right.

Yet there's a gap.

The amount of effort we put into observability does NOT line up with the progress we've made towards its goals: better detection, faster root-cause analysis, and more reliable apps.

Why?

It's because the real problem isn't about data, tooling, or process. It's about our ability - or inability - to understand and reason about the data we already have.

Observability made us very good at producing signals, but only slightly better at what comes after: interpreting them, generating insights, and translating those insights into reliability.

PART 3: OBSERVABILITY'S FUTURE

It's true that observability isn't living up to expectations. But that doesn't mean it's not useful or important. In fact, I believe observability is poised to become one of the most important technology categories of the coming decade.

The reason: software is changing again.

In the 2010s, observability emerged as an antidote to complexity. In 2026, software engineers face the largest complexity crisis we've ever come up against: AI.

AI is cutting the cost of writing code to zero. As a result, engineering teams are shipping vast amounts of features and at breakneck speeds. Codebases are getting bigger and bigger.

Meanwhile, vibe-coding platforms have brought software development to the masses. More apps will be built and deployed this year than in all previous years combined.

We're on the verge of an "infinite software crisis".

This raises an uncomfortable question: how will we support, maintain, and operate this ever-growing mountain of software? I'm willing to bet the answer is observability.

Just not the version we have today...

View original

联系我们 contact @ memedata.com