💬 Like the book? Hate it? Want to sponsor the next book or get your product featured? Send a review to marco@downtime.sh · 📤 LinkedIn?@in/marcoaguero · 📤 Share this page or link with your team!
Observable - Because staring at screens is a legit skill

📘 Observable

Because Staring at Screens is a Legit Skill

The most tactical and honest primer on monitoring & observability ever written. For engineers, SREs, DevOps, and tech execs.


📘 Introduction

Welcome to Observable, a modern and tactical guide to understanding the systems you build, operate, and scale. This book is not just a how-to - it’s a survival guide for the brave engineers navigating today’s complex distributed infrastructures.

We’re living in an era where "everything is green" often means "you’re just not looking hard enough." Systems lie. Dashboards mislead. Alerts go off for the wrong things - or worse, not at all. This book is your flashlight in the chaos, your compass when dashboards point nowhere, and your caffeine-free wake-up call that observability is no longer optional.

This book is for every engineer who’s been paged at 3 AM, for every tech lead who’s had to explain to leadership why uptime isn’t the same as reliability, and for every startup founder who wants to get monitoring right from day one (and save their team from alert fatigue by week two).

We’ll walk through the essentials of telemetry - logs, metrics, traces - but with no fluff. Just the real stuff. You’ll also learn how companies like Google, Netflix, and Meta approach this discipline, and how modern tools (OpenTelemetry, eBPF, AI) are changing the game. We’ll discuss how observability ties into culture, ownership, and reliability. And yes, we’ll talk about dashboards - the good, the bad, and the red blinking nightmares.

Each chapter comes with an executive summary and a TL;DR, so you can skip to what matters, share the important bits in a meeting, or pretend you read it all (we won’t judge). Buckle up - it’s going to be a fun ride.

🧠 Executive Summary

Observability is not a luxury - it's a survival skill. In complex distributed systems, traditional monitoring won't cut it. This book breaks down what matters, what doesn't, and where the future is headed. You’ll learn how to make systems understandable, alerts sane, and dashboards meaningful - all while avoiding the pitfalls that make engineers burn out.

Let’s dive in. Chapter 1 is where the real fun begins.


📍 Chapter 1: Observability – What & Why?

🧠 Executive Summary

Observability is more than a buzzword – it’s the art of understanding what’s happening inside complex systems by examining their outputs (logs, metrics, traces). Unlike traditional monitoring (which asks “Is it broken?”), observability asks “Why is it behaving that way?” In this chapter, we introduce observability, explain why it’s crucial for modern software, and set the stage for the rest of this primer with a dash of humor and real-world anecdotes.

🔍 What is Observability (and How Is It Different from Monitoring)?

In plain terms, observability is a property of a system that allows you to ask questions about its internals solely by examining its external outputs. Think of a car’s dashboard: monitoring tells you the “check engine” light is on, but observability lets you pop the hood (figuratively) and figure out why the light came on.

As one definition puts it, “Observability lets you understand a system from the outside by letting you ask questions... without knowing its inner workings.” Monitoring, on the other hand, is more about tracking predefined metrics or conditions (think known knowns) to detect failures.

So, observability vs. monitoring in a nutshell: Monitoring is like having security cameras (you get alerted when something specific goes wrong), whereas observability is like having a detective on call who can investigate any mystery using clues (data) left behind.

In practice, you need both. Monitoring gives you the early warning (e.g., an alert that latency is high), and observability gives you the tools to dig in and find the root cause among all the “unknown unknowns.”

As Charity Majors (CTO of Honeycomb) famously emphasizes, observability is about coping with unknown unknowns – being able to troubleshoot novel problems that you didn’t explicitly anticipate.

🌐 Why Modern Systems Demand Observability

Today’s systems are distributed, ephemeral, and occasionally chaotic (looking at you, microservices!). A single user request might traverse dozens of microservices, cloud functions, and databases. Traditional monitoring (think simple uptime checks) falls short when you have partial failures, intermittent slowdowns, or complex emergent behaviors.

Modern DevOps and SRE practices from companies like Google and Netflix have taught us that high reliability comes from not just detecting problems, but understanding them deeply. Google’s Site Reliability Engineering (SRE) handbook makes observability a core tenet, noting that you can’t meet reliability goals without insightful telemetry and service-level indicators.

Netflix, operating at massive scale, famously invests heavily in observability tooling (like their in-house telemetry platform Atlas) to ensure they can quickly pinpoint issues across their streaming infrastructure.

Moreover, techniques like chaos engineering (pioneered by Netflix’s Chaos Monkey) practically require observability – you intentionally break things in production to ensure your monitoring/observability can detect and diagnose the chaos. If your system can handle a wild Chaos Monkey randomly terminating instances without waking up the whole on-call team, congratulations – you have good observability and resiliency!

✅ In Summary

Observability is the modern “X-ray vision” for systems. It’s what lets engineers, DevOps, and SREs keep their sanity when production issues strike at 3 AM. Throughout this book, we’ll dive into how to achieve observability in practice, covering logs, metrics, traces, and all the fun stuff (yes, fun – we promise to keep it tactical and humorous).


📍 Chapter 2: The Three Pillars – Logs, Metrics & Traces

🧠 Executive Summary

In this chapter, we break down the classic “three pillars” of observability: logs, metrics, and traces. You’ll learn what each pillar represents, how they complement each other, and why all three are needed to get a 360° view of system health. We’ll keep things punchy (no boring lectures here!) and use analogies – think of logs as the detailed diary entries, metrics as vital signs, and traces as the full play-by-play replay of a complex game.

📜 Logs: The Diary of a System

Logs are the oldest pillar of observability – the text records that applications and systems produce to tell us what happened, and when. Every printf or console.log that developers sprinkle in code ends up as a log somewhere (hopefully a central log system rather than an engineer’s lost SSH session).

Formally, “Logs are machine-generated data from applications, services, and infrastructure” that describe events (errors, requests, state changes, etc.). They’re essentially the diary of the system’s life, often containing rich context.

Why logs matter: When something goes wrong, logs provide the narrative. For example, if an API is returning 500 errors, the logs might show a stack trace or error message at the time of failure. Logs excel at capturing detailed context – the exact error code, a specific userID that triggered a bug, etc. They’re invaluable for debugging (“Let’s grep the logs for any NullPointerException around 2:35 AM”).

Companies like Splunk and Elastic (ELK Stack) built empires on storing and searching logs, because at scale, collecting logs from hundreds of services can turn into “finding a needle in a haystack” without the right tools.

However, logs can be voluminous. In a complex microservices deployment, logs can easily be gigabytes per hour. Managing log volume (and cost) is a challenge – one we tackle later with telemetry pipelines (see Chapter 10). It’s also why other pillars like metrics exist: to summarize the system state without storing every detail. But when you need that detail, nothing beats a good log entry.

📈 Metrics: The Vital Signs

If logs are the diary entries, metrics are the vital signs or scoreboard numbers. A metric is a numeric measurement captured over time – e.g., CPU usage, request latency, number of active users.

Metrics are typically stored in time-series databases and are optimized for real-time querying and long-term trending. One definition puts it nicely: “Metrics are a numerical representation of data measured over a period.” Think of metrics as an at-a-glance health check.

Why metrics matter: Metrics are great for monitoring. They give you the pulse of the system. For instance, you might have metrics for HTTP request rates, error rates, and latency (the famous “golden trio” of RED: Rate, Errors, Duration). Google’s SREs talk about the Four Golden Signals – latency, traffic, errors, saturation – which are all metrics that every system should expose.

If a metric crosses a threshold (say CPU > 90% or error rate spikes), it’s a clue that something’s off, possibly triggering an alert (more on that in Chapter 5).

Metrics are usually aggregated (e.g., average latency over 1 minute) which makes them lightweight compared to logs. They’re perfect for dashboards and for feeding into automated alerting systems. Tools like Prometheus have popularized metric collection with a pull-based model and a powerful query language (PromQL) to slice and dice those time-series.

Other companies have custom metric platforms (Netflix’s Atlas, Facebook’s Monarch) to handle millions of data points per second.

The downside? Metrics lack the granular detail of logs. If an error rate metric tells you “errors jumped to 5%,” you’ll likely need logs or traces to diagnose which errors. That’s why metrics and logs are complementary.

🧵 Traces: The Story of a Request

Traces represent the end-to-end path of a single transaction or request through a system. In modern distributed systems, a user action (like clicking “Buy Now”) might generate a trace that touches dozens of microservices – from the web frontend, to the authentication service, to the payment gateway, and so on.

A trace is composed of spans (each span is a unit of work in a service) that together form a tree or timeline of that request. In simpler terms, a trace is the story of how one request navigated your system.

Why traces matter: Traces shine in diagnosing performance issues and understanding system behavior across service boundaries. While a log might show an error in one service, a trace can show that the error in service C was actually caused by a timeout from service A → B → C. They connect the dots.

Traces rely on context propagation – passing along a trace identifier through all the calls – which we’ll explore in Chapter 7. Tools like Jaeger and Zipkin (open source), or AWS X-Ray and Datadog APM (cloud offerings), collect and visualize traces so you get a nice Gantt chart of spans.

Shoutout to Google’s Dapper (the famous internal tracing system that started it all) which inspired many of these tools. In fact, Jaeger (now a CNCF project) was born at Uber to trace their complex microservices and was later open-sourced.

One gotcha: capturing every trace in a high-traffic system is expensive (just like logs). Often, teams will sample traces (collect maybe 1% of them) to manage overhead. Even with sampling, traces provide tremendous insight when something goes wrong in those spaghetti-microservice call graphs.

🔁 Together: One Big Happy Observability Family

Logs, metrics, and traces aren’t isolated silos – the real power is when you use them together. For example, an alert might fire based on a metric (high error rate). You then check a dashboard of metrics to see which component is misbehaving. You dive into logs to find an error message or stack trace, and pull up a trace to see the exact path of a problematic request.

This workflow is common in SRE and DevOps teams. Modern observability platforms (like Datadog, New Relic, or Honeycomb) emphasize correlating these signals – click on a trace and see the logs and metrics for that same timeframe, all in one place.

Each pillar has its strengths: logs for depth, metrics for breadth, traces for context. If you’re missing one of them, it’s like trying to solve a mystery novel with a third of the pages ripped out. Sure, you might infer the villain, but it’s much harder without all the clues.

✅ TL;DR

Use them together, not in isolation. Real observability happens when you connect the dots.


📍 Chapter 3: Instrumentation – Getting the Data Out

🧠 Executive Summary

Instrumentation is how we teach our systems to speak. This chapter covers the techniques and tools for instrumenting code to emit logs, metrics, and traces. We’ll explore manual instrumentation (sprinkling print statements and metrics in code), auto-instrumentation (because who doesn’t like free magic?), and emerging tech like eBPF that can capture telemetry without code changes. Expect practical tips and a few horror stories of “uninstrumented” black boxes.

🔧 What Does it Mean to Instrument?

Instrumentation is the act of adding code (or using tools) to produce telemetry data from your application. Without instrumentation, your beautifully running microservice is essentially a black box – you won’t know what it’s doing internally. By instrumenting, we make the service observable. This could mean emitting a log line when a transaction occurs, updating a metric counter for each request, or starting a trace span for an outbound API call.

In ye olde days, instrumentation was as simple as adding print statements for debugging. Today, it’s a bit more sophisticated (and thankfully, there are libraries!). A key concept here is code visibility: we modify or augment code to expose its inner workings. As a comical analogy, if your code is a burrito, instrumentation is cutting the burrito in half to see the delicious fillings (or the nasty bug, as it may be).

🛠️ Manual vs. Automatic Instrumentation

Manual instrumentation means developers explicitly write code to emit telemetry. For example, you might call a logging function at critical points, or use a metrics library to increment counters. This gives fine-grained control – you know exactly what you’re logging or measuring – but requires developer effort and discipline.

One pitfall: developers often only add instrumentation for known problem areas; anything they didn’t think of remains in the dark (back to those unknown unknowns).

Automatic instrumentation means you let a framework or agent do it for you. Java has agents (like javaagent) that can intercept library calls and automatically trace web requests or DB queries. OpenTelemetry provides auto-instrumentation packages for popular languages to capture common operations without writing code. In containers, sidecar agents can intercept traffic too.

The benefit: quick coverage. The downside: some loss of control and potentially too much data. Think of it as a spy gadget that bugs your app by default – but it may collect more than you actually need.

📉 Real-World Example

An engineer forgets to log a key error condition. Something breaks in production. No logs. But auto-instrumentation captured the HTTP 500 and a stack trace – a lifesaver. Auto-tools aren't perfect, but they’re better than silence.

🌐 OpenTelemetry and the Rise of Standardized Instrumentation

OpenTelemetry (OTel) is the new industry standard – a vendor-neutral toolkit for traces, metrics, and logs. It grew out of OpenCensus + OpenTracing. With OTel, you write your telemetry once and export it anywhere (Prometheus, Jaeger, Datadog, etc.).

You get standardized APIs for traces, metrics, logs – in every major language. Companies like Google, Microsoft, and AWS contribute heavily. Most observability vendors support receiving OTel-formatted data, making it the “universal adapter” for telemetry.

Want to switch from Jaeger to Honeycomb? Just change the exporter – no code rewrite. Want to send data to two tools? Add a second exporter. OTel makes it that easy.

🐧 eBPF: Zero-Code Instrumentation from the Kernel

eBPF (extended Berkeley Packet Filter) lets you trace your apps without touching the code – by attaching programs to kernel events. It’s like a stealth mode debugger running at OS level.

You can trace every time a file is opened, a network connection is made, or a syscall occurs – across the entire machine. Pixie (by New Relic) uses eBPF to instantly gather telemetry from Kubernetes apps without code changes.

It’s powerful, efficient, and mind-blowing. But…

Meta (Facebook) uses eBPF to monitor CPU usage and syscalls. Cloudflare uses it for network packet inspection. It’s the future – and we cover more in Chapter 9.

💡 Best Practices for Instrumentation

✅ TL;DR

Next up: how to use that data to define reliability with SLIs and SLOs.


📍 Chapter 4: SLIs and SLOs – Defining Reliability

🧠 Executive Summary

How do you know if your system is reliable enough? Enter SLIs (Service Level Indicators) and SLOs (Service Level Objectives). In this chapter, we demystify these acronyms popularized by Google’s SRE culture. An SLI is basically a measurement (like uptime, latency) that tells you how your service is performing. An SLO is the target you aim for (e.g., 99.9% uptime). We’ll explain why these are vital for focusing your observability efforts on what really matters to users (hint: not every metric is an SLI). Plus, we’ll throw in a bit of humor about overly ambitious 100% SLOs and the legendary “error budget” concept.

📏 What’s an SLI?

A Service Level Indicator (SLI) is a quantifiable measure of some aspect of the service’s performance or reliability. Common SLIs include availability, latency, error rate, and throughput. But the best SLIs reflect what users actually care about.

For a pizza delivery app, a great SLI could be “percent of pizzas delivered within 30 minutes.” For an API, maybe “percent of successful HTTP 200 responses under 500ms.”

Internal metrics like CPU usage? Not SLIs. Users don’t care unless it causes slowdowns. But request success rates? They feel that pain immediately.

Start simple with the Four Golden Signals: latency, traffic, errors, saturation. Then refine: “95th percentile checkout latency under 2s” is more meaningful than generic “latency.”

🎯 What’s an SLO?

A Service Level Objective (SLO) is a target for your SLI. It’s the threshold you aim to meet – like “99.9% uptime this quarter” or “99% of requests succeed daily.”

It’s how you define success for reliability. Importantly, it also defines acceptable failure: the inverse of your SLO is your error budget. That’s the wiggle room for bad days – a super useful concept.

For example:

Why not just aim for 100%? Because 100% is a fantasy. It’s unachievable, expensive, and makes engineers miserable. The difference between 100% and your SLO is the allowance for reality.

📈 Connecting SLOs to Monitoring and Alerts

SLOs give you a north star. Once you define them, point your dashboards, alerts, and incident response toward those objectives. If your error budget is getting chewed up fast, that’s when an alert should fire.

Modern observability tools support this directly. Grafana has SLO panels. SoundCloud has open-sourced tools to track burn rate (e.g., “at this rate, we’ll breach our SLO in 4 hours!”).

This approach is way better than alerting on every little spike. It aligns engineering work with user impact.

💼 SLOs for Execs and Customers (SLAs)

SLOs are usually internal. But when shared with customers, they become SLAs – Service Level Agreements. An SLA might say “99.9% uptime” and promise service credits if breached. Internally, though, you’ll want to set your SLO tighter than your SLA to preserve a buffer.

Executives love SLOs too – they translate reliability into numbers that tell a story. “We missed our 99.9% target by 0.05%” sounds better than “yeah, some stuff broke.”

🔁 Recap and Shoutouts

This approach came from Google’s SRE practice and has since been adopted by Netflix, Facebook, and others. If you’re just starting: pick 2-3 key SLIs, set reasonable SLOs, and observe. You’ll learn fast what matters.

Most importantly: not every metric needs an SLO. Don’t go full spreadsheet warrior. Choose the metrics that define reliability for your users, and focus there.

Up next: alerts! How do you build them without destroying your team’s sleep schedule? Stay tuned.


📍 Chapter 5: Alerting – Wake Up, It’s Broken!

🧠 Executive Summary

Alerting is the art of deciding when to wake up the humans. In this chapter, we discuss how to turn those metrics and SLOs into actionable alerts. We’ll cover best practices for setting alerts (spoiler: not too sensitive, not too late), the importance of runbooks, and tools of the trade (PagerDuty, etc.). Expect a few humorous takes on 3 AM pages and the infamous “alert fatigue” syndrome. By the end, you’ll know how to make sure your alerts are the helpful kind (and not the boy-who-cried-wolf kind).

🚨 The Purpose of Alerts

Alerts exist to grab someone’s attention when something is wrong and needs human intervention. The ideal alert triggers when your system is in a bad state that won’t auto-recover, and someone needs to fix it ASAP to avoid user impact.

A common mantra is “pages should be actionable.” If nothing can be done at 3 AM, it shouldn’t wake someone. Alerting on 90% CPU? Maybe noise. Alerting on "service unresponsive for 5 minutes"? That’s real.

😫 Alert Fatigue is Real

Alert fatigue happens when teams get bombarded with too many noisy or false alerts. It leads to burnout and desensitization – engineers start ignoring alerts entirely.

One engineer set their page ringtone to a clown honk to add humor to the misery. After 50 non-actionable pages in a week, even that honk sounded like despair.

Google’s SRE teams recommend: if you wouldn’t act on it at 2 AM, don’t page it at 2 AM. Log it. Triage it during business hours.

📏 SLO-Based Alerts and What Works

Best-practice alerting starts with your SLOs. For example, alert if 5% of requests have errors in the last 10 minutes – that burn rate will crush your 99% target if left unchecked.

📟 On-Call and Incident Response

Alerts are only useful if someone’s on-call to respond. Rotations ensure coverage, and good alerting protects engineers from burnout.

Every alert should come with a runbook: what does it mean, what’s the likely cause, what should the responder try?

Tools: PagerDuty, Opsgenie, VictorOps, even SMS if you’re scrappy. What matters most is clear escalation paths, and making sure alerts get acknowledged and resolved.

Larger orgs have layered on-call: dev teams on the front line, SREs or platform teams for deeper issues.

🤖 Automate and Evolve

Many alerts can trigger automated actions. Restart a pod. Reroute traffic. If automation works, don’t page. If it fails, then page.

AI is slowly entering alerting – correlating logs, traces, and metrics to tell you, “Hey, deploy X caused Y.” Still early days, but tools like Dynatrace and Moogsoft are experimenting here.

Make alert review part of your postmortems. Did we get alerted fast enough? Was it actionable? Did we miss anything? Iteration isn’t just for features – it applies to observability too.

🎯 The Alerting Golden Rule

Your team must trust the pager. When it goes off, it better matter. If you do that, engineers will respond. If not, you’ll hear the clown honk of despair all over again.

Next up: dashboards. Time to make your metrics look good and make sense.


📊 Chapter 6: Dashboards – Visualizing the Chaos

🧠 Executive Summary

Dashboards are the face of your observability data – the graphs, charts, and gauges that let you and your execs glance at the system and (hopefully) say “all is well” or “oh no, something’s on fire.” In this chapter, we talk about building effective dashboards. We’ll highlight tools like Grafana and Kibana, give tips on avoiding the “wall of charts” syndrome, and ensure you’re balancing between too much data and too little. And yes, we’ll sprinkle some humor about that one dashboard with 50 red blinking panels that no one knows how to read.

📺 The Role of Dashboards

A dashboard is a curated set of visualizations showing the state of one or more systems. It’s your mission control interface. Dashboards are used in NOCs, in exec meetings, and by on-call engineers to drill down into issues. Good dashboards point to problems fast. Bad ones? Pretty, but useless.

Think story-first: a microservice dashboard might include request rate, error rate, latency (SLIs/SLOs), CPU/memory, and downstream call latency. When an alert fires, your dashboard should confirm and narrow it down fast.

📈 Grafana and Friends

The Grafana LGTM stack (Loki, Grafana, Tempo, Mimir) lets you build unified dashboards with logs, metrics, and traces all in one place.

📐 Designing Effective Dashboards

A humorous example: One company had a “Game of Thrones” dashboard. Each service had a house sigil that burst into flames (via GIF) when errors spiked. Fun? Yes. Useful? Not so much when everything’s on fire and dragons block the metrics. Prioritize clarity over novelty.

📤 Sharing and Evolving Dashboards

Dashboards evolve as your systems do. Deprecate old charts. Add new ones for new features or risks. And build dashboards for different users:

Grafana and others support embedding dashboards into docs, wikis, and even status pages. Grafana’s “Play” site is full of examples – including a playable text adventure game built with observability data. Yes, really.

📌 TL;DR

Dashboards make telemetry human-readable. They’re great for real-time ops and long-term trend spotting. After each incident, ask: did we have a chart that would've helped? If not, make one.

Next up: distributed tracing – the Google Maps of your infrastructure. Let’s trace things out.


🔍 Chapter 7: Distributed Tracing – Following the Breadcrumbs

🧠 Executive Summary

Time to channel our inner Sherlock Holmes. Distributed tracing is all about following a request as it hops through multiple services – piecing together the breadcrumb trail left behind. In this chapter, we explain how tracing works, why context propagation is the secret sauce, and how tools like Jaeger, Zipkin, and Tempo help reconstruct the story of a transaction. Expect analogies (like finding Wally/Waldo in a microservice crowd) and a few industry war stories.

🚦 The Need for Tracing in Microservices

In monoliths, debugging was a matter of reading a call stack. In distributed systems? That stack is scattered across services and machines. Tracing reassembles that stack – it’s your distributed call graph.

Imagine a request: Service A → B → (C + D in parallel) → return. Tracing shows each step’s timing (as spans) and links them with a shared trace ID. It’s like a relay race where every baton pass is logged with timestamps.

🔗 How Tracing Works: Context Propagation

At the heart of tracing is context propagation. When Service A starts a trace, it sends metadata (trace ID, span ID) with downstream calls. Services B and C continue the trace using that info.

Headers like traceparent (W3C) or X-B3-TraceId (Zipkin) keep the context alive across calls. Without this, your trace is fragmented.

🧰 Tools and Implementations

Shoutout to Dapper (Google's internal tracing) – it pioneered tracing at internet scale. Uber built Jaeger when their microservices exploded in number, and open-sourced it to everyone’s benefit.

📉 When to Trace (and When Not To)

Tracing is especially powerful for latency analysis and dependency mapping, but overkill for things like capacity planning. Still useful in monoliths, too – and even browser-side traces!

🧪 Tracing in Practice

Example: You get a "checkout is slow" alert. You look up traces tagged “checkout”. One shows 3s total. It breaks down like this:

Without tracing, you’d guess. With tracing, you know. Trace IDs in logs let you pivot to full logs. Metrics might show “high latency,” but traces explain why.

🕵️‍♀️ The Lighter Side of Tracing

Traces often reveal awkward truths:

It’s like debugging a distributed conspiracy. And tracing exposes it all.

🎯 Wrapping Up

Tracing gives you visibility and precision in complex systems. It’s not free – but it pays off in faster debugging, better root cause analysis, and fewer “no idea what broke” moments.

Thanks to OpenTelemetry, setting up tracing is easier than ever. So go trace something! Next, we’ll look at how OpenTelemetry ties together logs, metrics, and traces in Chapter 8.


🛰️ Chapter 8: OpenTelemetry – A New Hope for Instrumentation

🧠 Executive Summary

OpenTelemetry (OTel) is the future (and present) of observability data collection. In this chapter, we give a crash course on what OpenTelemetry is, why it emerged, and how it simplifies the collection of logs, metrics, and traces. We’ll discuss the components (API, SDK, Collector), and how it enables a “write once, use anywhere” approach for telemetry. Think of OpenTelemetry as the universal translator that brings harmony to the observability galaxy – no more proprietary agents for each vendor.

🚀 Why OpenTelemetry?

Before OTel, telemetry was a mess of fragmented tools: StatsD, Prometheus clients, OpenTracing, various log libraries. Switching tools meant painful re-instrumentation. OpenTelemetry fixed this by unifying standards across all three pillars: traces, metrics, and logs.

With OTel, you instrument once and can export to any supported backend: Prometheus, Jaeger, Datadog, GCP, etc.

🔧 Components of OpenTelemetry

This decouples app logic from the observability backend.

🚀 Getting Started with OTel (Python Example)


from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, OTLPSpanExporter

trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Export spans via OTLP to collector
span_exporter = OTLPSpanExporter(endpoint="http://otel-collector:4318/v1/traces")
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(span_exporter))

# Now instrument a request handler
with tracer.start_as_current_span("handle_request") as span:
    do_some_work()
    span.set_attribute("user.id", current_user)

With auto-instrumentation, even less code is needed. Libraries for Flask, Django, etc., can auto-inject spans for HTTP requests and DB queries.

⚙️ eBPF Meets OpenTelemetry

eBPF allows kernel-level telemetry without modifying your code. Tools like Pixie can use eBPF to collect data and export it as OpenTelemetry spans and metrics. You get observability for legacy or “black box” apps with zero code changes.

🌐 Adoption and Ecosystem

It’s the lingua franca of observability now. Even logging, the slowest adopter, is becoming part of the OTel standard thanks to the Collector’s growing capabilities.

⚠️ Challenges

Adopting OTel isn’t plug-and-play everywhere. You’ll deal with:

💡 Summing Up

OpenTelemetry has unified the observability stack with a standard, extensible model. It reduces lock-in, increases interoperability, and makes your telemetry future-proof. Next up, we dive deeper into the wizardry of eBPF in Chapter 9 – and how it’s changing everything underneath.


🧬 Chapter 9: eBPF – Observing from the Kernel Up

🧠 Executive Summary

We touched on eBPF earlier, but here we dive deep. eBPF is like having a microscope inside the operating system – allowing you to collect metrics, traces, and logs from kernel space without modifying your applications. This chapter explains what eBPF is, how it’s being used for things like low-overhead profiling and network tracing, and why it’s considered a game-changer for observability (and security). We’ll highlight tools like BPFtrace, Cilium, and how companies (from Meta to Cloudflare) use eBPF to peek under the hood at scale. And of course, we’ll keep it fun – perhaps likening eBPF to a magical spy that lives in the Linux kernel.

🧪 eBPF in a Nutshell

Extended Berkeley Packet Filter (eBPF) is a technology in the Linux kernel that lets you run custom programs in kernel space, safely. eBPF programs can hook into system calls, network events, tracepoints, and more. When those events occur, the eBPF code runs instantly – gathering data or modifying behavior (in observability, we usually just gather).

Why it's awesome:

🔍 Real-world Uses of eBPF

Imagine profiling MySQL query latency by hooking into the MySQL process without installing anything inside it. That’s eBPF magic.

🛠️ Tools and Frameworks

⚠️ Limitations and Considerations

🤖 The Future of eBPF in Observability

eBPF is now supported by the eBPF Foundation under the Linux Foundation, with backing from Facebook, Netflix, Google, and others. It's increasingly embedded in observability stacks – sometimes without users even knowing.

Imagine an AI detecting an anomaly, and triggering an eBPF probe live in production to collect just the data it needs. That’s where we’re heading: proactive, real-time, on-demand observability with minimal performance hit.

🧩 Wrapping Up

eBPF is a powerful tool that exposes what’s happening inside your systems – down to syscalls and memory allocations – with almost no friction. It complements metrics, logs, and traces by covering the blind spots and enabling deep kernel-level insights.

Coming up next: Chapter 10 explores how all these tools and signals come together in cohesive observability stacks – from open source to vendor ecosystems.


🔧 Chapter 10: Open Source Observability Stack

🧠 Executive Summary

Why buy the cow when you can have the milk for free? In this chapter, we explore the popular open source tools that can be assembled into a modern observability stack. We’re talking Prometheus for metrics, Grafana for dashboards, Loki for logs, Tempo/Jaeger for traces, Mimir/Cortex for scalable metrics storage, and more. We’ll show how these pieces fit together (often dubbed the “LGTM stack”) and discuss pros/cons of rolling your own vs. commercial options.

📈 Prometheus – The Metrics Powerhouse

Prometheus is a CNCF darling. It scrapes metrics from services, stores them in a time-series database, and makes them queryable with PromQL. Its exporter ecosystem is massive, and it integrates tightly with Alertmanager for triggering alerts.

📊 Grafana – Visualization Hub

Grafana isn’t just pretty dashboards – it now includes alerting, data source unification, and plugins. Grafana Labs extended this into a full stack: Loki (logs), Tempo (traces), Mimir (metrics), forming the “LGTM stack.” Add Prometheus as a source and you’ve got a powerful open solution.

📜 Loki – Logs on a Diet

Loki indexes logs by labels instead of full text, making it cheap and fast for scoped queries. It’s a favorite for Kubernetes environments due to its label-centric design. Uses LogQL for querying.

🧵 Tempo and Jaeger – Tracing for the Masses

Both ingest OpenTelemetry data and help visualize request flows across services.

🧩 Tying It Together: A Sample Stack

⚖️ Pros and Cons of OSS Stack

✅ Pros: ❌ Cons:

🏅 Shoutouts

🔚 Conclusion

OSS observability gives you power and freedom – at the cost of setup and maintenance. If you love building infrastructure, it’s a dream playground. If you just want insights fast, you may want to peek at the SaaS world. And that’s exactly where we’re headed next: Chapter 11, where we explore enterprise-grade observability platforms and the trade-offs of going fully managed.


🏢 Chapter 11: Enterprise Tools & SaaS Platforms

🧠 Executive Summary

Not everyone wants to maintain their own observability stack – hence the rise of commercial solutions. In this chapter, we explore the offerings of Datadog, New Relic, Splunk, Dynatrace, Elastic, Honeycomb, and others. These platforms provide all-in-one observability with hosted storage, built-in analytics, and ease of use – but often at a premium. We'll break down where they shine, where they cost you, and how enterprises blend SaaS and OSS to get the best of both worlds.

✨ The All-in-One Promise

Datadog, Dynatrace, New Relic, and peers sell convenience: one agent for metrics, logs, traces, and infra. Key features:

Honeycomb focuses on high-cardinality event data, letting users "BubbleUp" to discover unknown issues across dimensions. Splunk and Elastic also offer log + APM + metrics setups (self-hosted or cloud). For teams needing results fast, these solutions are attractive.

🏢 Enterprise Features

Example: Netflix built its own tools; others may not have those resources, making SaaS more appealing despite the cost.

💸 Cost Considerations

SaaS pricing is often volume-based (hosts, GB logs, trace spans). It adds up fast. We've heard stories like "our Datadog bill surpassed our AWS bill."

Vendors are adapting – Grafana Cloud, New Relic, and others now experiment with more flexible pricing models.

⚖️ SaaS vs OSS – When to Choose What

Common hybrid examples: Splunk for logs, Prometheus for metrics, SaaS APM for cloud apps. Flexibility matters more than brand loyalty.

☁️ Cloud-Native Offerings

These services integrate tightly with their platforms but can lock you in. Many orgs use these for infra telemetry and SaaS/OSS for deeper visibility.

📈 Industry Trends

🎯 Shoutouts & Examples

🔚 Bottom Line

Enterprise observability platforms deliver value quickly but come with trade-offs in cost and flexibility. Knowing the strengths of each platform helps teams avoid lock-in and optimize their observability spend. And if you understand the core concepts (SLIs, tracing, logs, metrics), you’ll thrive no matter the tool.

Next up: Chapter 12 – Observability Culture – where we shift from tools to the human side of the equation.


🧠 Chapter 12: Culture and Ownership – You Build It, You Watch It

🔍 Executive Summary

Observability is not just tooling – it’s culture. This chapter explores how mindsets, practices, and organizational design impact your ability to detect, debug, and recover from failures. We cover “you build it, you run it”, blameless postmortems, embedding observability into daily workflows, and how to reward those who champion reliability. Remember: metrics mean nothing if nobody is looking at them. The right culture makes observability actionable.

🚧 You Build It, You Run It

This DevOps principle emphasizes that the team who builds a system should operate it too. When developers know they’ll be on-call, they naturally care more about observability. At Netflix, full-cycle engineers handle everything from coding to operations. This drives better logs, metrics, and alerts.

Compare that to the legacy model: devs ship code, ops catches the fallout. That creates gaps and delays. Closing that loop reduces MTTR and improves system understanding.

🧯 Blameless Postmortems

Every incident is a learning opportunity. Postmortems should be:

Netflix’s chaos engineering relies on this trust – engineers explore failure without fear. Observability makes that possible.

📆 Making Observability a Habit

🤝 Ownership and Cross-Functional Teams

SREs don’t take ownership for developers – they co-own. At Google, SLOs and error budgets are a shared responsibility. Whether or not you have formal SREs, you can:

When error budgets matter, teams naturally invest in better instrumentation and faster recovery tools.

🔧 DevOps and Observability

Observability is the glue of DevOps. It connects devs and ops through shared data. Without logs, metrics, and traces, feedback loops break.

Imagine a crash with no logs. Ops is blind. Devs don’t learn. But with observability, everyone sees the same timeline and can respond quickly. It enables true collaboration.

🌟 Recognizing Observability Heroes

In every org, there’s someone who:

Support and amplify them. Then spread that knowledge so every team becomes self-sufficient.

🚨 Incident Mindset

Incidents are high-stress. Observability tools should reduce panic, not add to it. Tips:

Fear-free teams build better systems.

📢 Final Thoughts

Culture amplifies technology. A team that owns their systems, learns from mistakes, and treats observability as everyone’s job will outperform any team with fancy dashboards but no buy-in. As the Google SRE book says: “Hope is not a strategy.” Prepare. Measure. Share.

Our final chapter explores where observability is heading – automation, AI, predictive diagnostics, and beyond.


🔮 Chapter 13: The Future – AI and Observability

🚀 Executive Summary

In our final chapter, we look ahead. Observability is going beyond charts and alerts - it’s becoming intelligent. From AI-driven anomaly detection to self-healing systems and conversational UIs, observability is evolving fast. We’ll touch on trends like AIOps, ML observability, privacy challenges, and how humans and machines will collaborate in future incidents. No flying cars, but maybe flying pagers that fix themselves.

🤖 AI-Driven Anomaly Detection

Modern observability means analyzing huge volumes of telemetry. Humans can’t keep up - so AI steps in. Examples:

These systems aim to reduce noise and spot issues early - like when a specific customer’s experience degrades before the averages show it. Still, false positives are a challenge, so human feedback is key for tuning the models.

🛠️ Automated Diagnostics and Remediation

AI can go beyond alerts. It can:

This isn’t sci-fi - systems like K8s do auto-restarts already. The next leap is intent-aware automation that understands symptoms and can act intelligently. Think: your observability agent as a junior SRE with fast reflexes.

📈 Observability for AI Systems

We’re not just using AI - we’re observing it too. ML Observability is a growing field. Monitoring:

Startups like WhyLabs and Arize AI tackle this, but many organizations adapt existing observability tools to track model behavior like traditional services. The key is: if it impacts production, it needs observability.

💬 Conversational Interfaces

Imagine asking your dashboard: “What changed in the last 30 minutes?” and getting a reply like:

“Traffic increased 12%, latency in service X rose to 400ms, deploy to version v12.7 happened 17 minutes ago. Correlated spike in cache misses.”

That’s the future of observability UX - powered by large language models and structured telemetry. It’s already emerging via plugins, bots, and AI assistants like Copilot suggesting trace spans in code.

🌐 Edge and Client-Side Observability

Real user monitoring (RUM), IoT telemetry, and edge observability are expanding. Collecting insights from browsers, mobile apps, and devices at the edge means:

Expect AI to assist in reducing “log-overload” and summarizing signals by user impact.

🔐 Privacy and Ethics

Telemetry data can contain sensitive info - user IDs, locations, actions. As AI ingests and acts on this, it must remain compliant:

The observability future must be ethical. Privacy engineering and compliance automation will become part of every observability strategy.

👩‍💻 The Human Remains Essential

AI helps, but humans define objectives, set SLOs, and interpret business impact. Future observability will be a partnership:

Roles like “automation curator” may emerge - tuning AI systems and encoding learnings from past incidents to prevent regressions.

🏁 Final Thoughts

We’ve come a long way - from server uptime checks to intelligent, context-aware telemetry across services, users, and even AI models themselves. The future is more automated, more distributed, and more human-centered.

Whether you’re building dashboards or training anomaly models, remember: observability isn’t just about knowing what’s wrong - it’s about empowering people to fix it fast. Stay curious. Stay instrumented. And may your alerts always be actionable.

💬 Thanks for reading. Follow @marcoaguero on LinkedIn (@marcoaguero/) and stay tuned for the next evolution of observability thinking. Happy debugging!