In my last post, Round Two, I wrote about my career, my passion for dev tools, and my decision to start a new company focused on observability.
I also wrote about my frustration with observability as it exists today. Why are the tools and workflows so bad? Why does it feel like so much work?
In today's post, I want to unpack that frustration by looking at observability in historical context. I'll examine observability through a simple three-part lens: its past, present, and future.
My goal is to understand and explain:
- Why observability emerged in the first place
- How observability evolved into the mess it is today
- Why, despite all the progress we've made, maintaining reliable systems is still SO hard in 2026
Let's go!
PART 1: OBSERVABILITY'S PAST
To understand observability, it helps to look at the environment that gave rise to it.
In the early 2010s, software engineers faced a crisis: with the rise of cloud computing, containers, and microservices, apps were becoming increasingly complex - too complex for any individual to fully understand.
Meanwhile, CI/CD was becoming more common, which meant an increase in the rate at which changes were deployed to production. The result: more frequent bugs and outages.
Suddenly, our old reliability playbook stopped working. We couldn't predict all the edge-cases, let alone write tests for them. Failure modes increasingly emerged from the complex interaction between services. When it came to root-cause analysis, logs and basic metrics were no longer enough.
Fortunately, the industry found a solution.
It came in two parts: a tool and a philosophy.
The tool was distributed tracing. After a quiet start in 2010, distributed tracing grew steadily over the ensuing decade until it was essentially ubiquitous:
- 2010: Google publishes Dapper, the first major paper on distributed tracing.
- 2012: Twitter introduces Zipkin, inspired by Dapper.
- 2015: Honeycomb, one of the first managed tracing platforms, is founded.
- 2016: OpenTracing is adopted by CNCF.
- 2016: Uber introduces Jaeger, inspired by Zipkin.
- 2017: Datadog launches APM, their managed tracing solution.
- 2018: O'Reilly publishes "Distributed Systems Observability".
The philosophy was observability. Originally coined by a rocket scientist in the 1960s, the term "observability" was popularized in the software community by Twitter's engineering team during the early 2010s. It quickly grew into a full-fledged product category and engineering discipline:
- 2013: Twitter publishes "Observability at Twitter".
- 2015: Charity Majors (Honeycomb founder) begins writing about observability, formalizing key concepts.
- 2017: Peter Bourgon (Soundcloud engineer who worked on Prometheus) coins the "Three Pillars".
- 2022: O'Reilly publishes "Observability Engineering".
In summary, distributed tracing gave engineers a way to debug modern apps, which were cloud-based and distributed by nature, while observability gave them the framework for how to think about reliability in this new operating environment.
My point here is that observability wasn't born in a vacuum. It emerged as a response to real-world engineering challenges. And it persisted because, for the most part, it worked.
But then things took a turn for the worse. Encouraged by early results, engineering teams started to over-invest in observability tools and processes.
More instrumentation. More dashboards. More monitors. SLOs. Error budgets. Runbooks. Postmortems. By the early 2020s, observability was no longer a means to an end. It was an end unto itself.
PART 2: OBSERVABILITY'S PRESENT
Today, most engineers will agree: observability is table stakes. If you're running production systems at almost any scale, you need a way to detect, mitigate, and resolve issues when they occur.
This is the role of modern observability platforms like Datadog, Grafana, and Sentry.
Yet, when I think about the current state of observability, one question stands out: why, after 10+ years of investment in better tools and processes, does observability still suck?
I'm serious!
Think about it.
Instrumentation takes forever. Dashboards are constantly out of date. Monitors misfire. Alerts lack context. On-call is a permanent tax on engineering productivity. Incident resolution can take hours, sometimes days.
We have more telemetry than ever, but getting an accurate mental model of your app in production is still a major challenge - not just for newgrads and vibe-coding Zoomers, but for experienced engineers too.
It's not for lack of effort. At companies of all sizes, engineering teams take observability seriously, investing enormous amounts of time and energy into implementing and maintaining observability systems.
They shell out for Datadog. They instrument EVERYTHING. They adopt structured logging and enforced standardized naming/labeling conventions. They create "golden" dashboards. They painstakingly tune their monitors. By any reasonable definition, these teams are doing everything right.
Yet there's a gap.
The amount of effort we put into observability does NOT line up with the progress we've made towards its goals: better detection, faster root-cause analysis, and more reliable apps.
Why?
It's because the real problem isn't about data, tooling, or process. It's about our ability - or inability - to understand and reason about the data we already have.
Observability made us very good at producing signals, but only slightly better at what comes after: interpreting them, generating insights, and translating those insights into reliability.
PART 3: OBSERVABILITY'S FUTURE
It's true that observability isn't living up to expectations. But that doesn't mean it's not useful or important. In fact, I believe observability is poised to become one of the most important technology categories of the coming decade.
The reason: software is changing again.
In the 2010s, observability emerged as an antidote to complexity. In 2026, software engineers face the largest complexity crisis we've ever come up against: AI.
AI is cutting the cost of writing code to zero. As a result, engineering teams are shipping vast amounts of features and at breakneck speeds. Codebases are getting bigger and bigger.
Meanwhile, vibe-coding platforms have brought software development to the masses. More apps will be built and deployed this year than in all previous years combined.
We're on the verge of an "infinite software crisis".
This raises an uncomfortable question: how will we support, maintain, and operate this ever-growing mountain of software? I'm willing to bet the answer is observability.
Just not the version we have today...