Observability Is Not Logging

grafana observability dashboard

When something breaks in production, there is a moment — usually around the third minute of an incident — where the nature of your observability setup becomes very clear. Either you can form a hypothesis about what’s wrong and go find evidence for or against it, or you’re scrolling through log lines hoping something catches your eye.

Most teams are in the second camp. Most teams think they’re in the first.

The confusion comes from conflating logging with observability. They’re related. Logs are one input into an observable system. But a pile of logs is not an observable system, for the same reason that a pile of lumber is not a house. The raw material is there. The structure that makes it useful is not.

This matters more than it used to, because the systems we’re operating have gotten more complex faster than our ability to understand them. A service that makes synchronous calls to three other services, publishes events to a queue, writes to two databases, and calls three external APIs can fail in more ways than you can enumerate in advance. The only way to debug it reliably is to be able to ask arbitrary questions about its behavior at runtime and get answers. That’s observability. Logs alone can’t do it.

What observability actually means

The formal definition comes from control theory: a system is observable if you can determine its internal state from its external outputs. Applied to software: your system is observable if, when something unexpected happens, you can figure out what happened and why using the data the system produces — without having to deploy new code to add more instrumentation.

That last clause is the important one. If your debugging process involves adding log statements, deploying, waiting for the issue to reproduce, and then reading the new logs — you do not have observability. You have a system that requires modification to understand. That’s a fundamentally different thing.

The three pillars — logs, metrics, and traces — get talked about a lot, often as if having all three means you have observability. You don’t, necessarily. You can have all three and still be flying blind if they’re not connected, not queryable, and not structured around your actual failure modes.

Logs record discrete events: a request came in, a database query ran, an error occurred. They’re the narrative of what happened. Their weakness is that they’re high-cardinality and unstructured, which makes them expensive to store and hard to query systematically.

Metrics record aggregated measurements over time: request rate, error rate, latency percentiles, queue depth, memory usage. They’re excellent for alerting and dashboards. Their weakness is that aggregation destroys the detail you need for debugging. A spike in p99 latency tells you something is slow. It doesn’t tell you which requests, for which users, triggered by which downstream call.

Traces record the path of a request through your system — every service it touched, every database call it made, every external API it hit, and how long each step took. They’re the thing that connects the other two. They answer the question “for this specific request that took four seconds, where did the time go?”

Without traces, you have two disconnected datasets: metrics that tell you something is wrong in aggregate, and logs that record individual events with no way to connect them to a specific user journey or request flow. With traces, you can go from “p99 latency spiked at 14:23” to “here are the ten slowest requests during that window, and here is the exact call that was slow in each one” in two minutes.

The structured logging gap

Before traces, there’s a more basic problem most teams have and underestimate: unstructured logs.

Unstructured logging looks like this:

logger.info(f"Processing order {order_id} for user {user_id}")
logger.error(f"Failed to process payment for order {order_id}: {str(e)}")

These are human-readable. They are not machine-queryable in any useful way. Finding all failed payments for a specific user requires grepping through potentially millions of log lines with a regex. Correlating a payment failure with the order processing log requires matching on the order_id string. Aggregating error rates by payment provider requires parsing the error message text, which varies based on how the developer wrote the f-string.

Structured logging looks like this:

import structlog

log = structlog.get_logger()

log.info(
    "order.processing.started",
    order_id=order_id,
    user_id=user_id,
    amount=order.total,
    currency=order.currency,
    items_count=len(order.items),
)

log.error(
    "order.payment.failed",
    order_id=order_id,
    user_id=user_id,
    payment_provider=provider.name,
    error_code=e.code,
    error_message=str(e),
    amount=order.total,
    retry_count=attempt,
)

Every field is a key-value pair. Every log line is a JSON object. You can now query: all payment failures in the last hour, grouped by payment provider, for orders over $100. That query takes seconds in any log aggregation system. It’s impossible against unstructured logs without significant parsing work.

The event name — order.payment.failed — is a structured identifier, not a sentence. You can filter on it exactly. You can count occurrences. You can set alerts on it. You cannot do any of this reliably against a log line that says “Failed to process payment for order abc123: Card declined.”

This is the minimum. Not the good version. The minimum version that makes logs actually queryable.

Instrumentation that tells the truth

Metrics are where teams feel most confident and are most often wrong.

The problem is not usually that teams have no metrics. It’s that the metrics they have are measuring the wrong things, or measuring the right things in the wrong way.

The wrong things: server CPU usage, memory usage, disk space. These are infrastructure metrics. They’re useful for capacity planning. They’re almost never the first indicator that something is wrong for users, and they’re rarely the root cause of user-facing problems. A service can be running at 40% CPU while returning errors to 30% of users. The CPU graph looks fine. The users are not fine.

The right things: the RED method — Rate, Errors, Duration — measured for every service boundary. Every inbound HTTP endpoint. Every outbound call to a database, a cache, an external API. Every queue consumer.

from prometheus_client import Counter, Histogram
import time
from functools import wraps

REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status_code']
)

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5]
)

DOWNSTREAM_CALL_COUNT = Counter(
    'downstream_calls_total',
    'Calls to downstream services',
    ['service', 'operation', 'status']
)

DOWNSTREAM_CALL_DURATION = Histogram(
    'downstream_call_duration_seconds',
    'Downstream call duration',
    ['service', 'operation'],
    buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1]
)


def track_downstream_call(service: str, operation: str):
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            start = time.perf_counter()
            status = "success"
            try:
                result = await func(*args, **kwargs)
                return result
            except Exception as e:
                status = type(e).__name__
                raise
            finally:
                duration = time.perf_counter() - start
                DOWNSTREAM_CALL_COUNT.labels(
                    service=service,
                    operation=operation,
                    status=status
                ).inc()
                DOWNSTREAM_CALL_DURATION.labels(
                    service=service,
                    operation=operation
                ).observe(duration)
        return wrapper
    return decorator


# Usage
class PaymentClient:
    @track_downstream_call(service="stripe", operation="create_payment_intent")
    async def create_payment_intent(self, amount: int, currency: str):
        # actual implementation
        pass

    @track_downstream_call(service="stripe", operation="confirm_payment")
    async def confirm_payment(self, payment_intent_id: str):
        # actual implementation
        pass

Now you know: how many calls you’re making to Stripe per minute, what percentage are failing, and what the p50/p95/p99 latency looks like — broken down by operation. When Stripe has a partial outage affecting only payment confirmation, you see it immediately in the stripe.confirm_payment error rate and duration, while create_payment_intent looks fine. Without this, you see elevated error rates somewhere in your system and spend twenty minutes figuring out which downstream dependency is the problem.

The wrong way to measure duration: averages. An average latency of 120ms can coexist with 5% of requests taking 4 seconds. The average looks acceptable. Those users are having a terrible experience. Measure and alert on percentiles — p95 and p99 — not averages. The buckets in the Histogram above are deliberately skewed toward low values because that’s where most requests should be, and outliers at the high end are the ones that matter.

Distributed tracing without the overhead excuse

The reason most teams don’t have distributed tracing is that it seems expensive to set up and expensive to run. Both of these things are less true than they used to be.

The setup cost has dropped significantly. OpenTelemetry has become the standard instrumentation library across languages, which means you instrument once and can send to any backend — Jaeger, Tempo, Honeycomb, Datadog, whatever. The instrumentation for common frameworks — FastAPI, Express, Django, Spring — is automatic. You add the library, configure the exporter, and HTTP requests are automatically traced without touching your application code.

# main.py — automatic instrumentation for FastAPI
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://tempo:4317"))
)
trace.set_tracer_provider(provider)

# Instrument the framework and common libraries automatically
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument(engine=engine)

Every inbound request creates a trace. Every outbound HTTP call and every database query is automatically recorded as a child span. You now have end-to-end request traces with zero manual instrumentation.

For the spans that matter most — the business operations, not just the HTTP calls — add manual instrumentation:

from opentelemetry import trace

tracer = trace.get_tracer(__name__)


async def process_order(order_id: str) -> Order:
    with tracer.start_as_current_span("order.process") as span:
        span.set_attribute("order.id", order_id)

        order = await get_order(order_id)
        span.set_attribute("order.total", float(order.total))
        span.set_attribute("order.items_count", len(order.items))

        with tracer.start_as_current_span("order.inventory.check"):
            await check_inventory(order)

        with tracer.start_as_current_span("order.payment.process"):
            payment = await process_payment(order)
            span.set_attribute("payment.provider", payment.provider)
            span.set_attribute("payment.intent_id", payment.intent_id)

        with tracer.start_as_current_span("order.fulfillment.create"):
            await create_fulfillment(order, payment)

        return order

Now when an order takes eight seconds instead of the expected one second, you open the trace and see: inventory check took 50ms, payment processing took 7.4 seconds, fulfillment creation took 200ms. The payment processing span is the problem. You open the payment processing span and see it made four retry attempts to Stripe before succeeding. You’ve gone from “order processing is slow” to “Stripe retries are adding latency” in thirty seconds.

The running cost argument is also weaker than it appears. You don’t need to store 100% of traces. Sampling — storing 10% of successful traces and 100% of slow or failed traces — gives you complete visibility into problems at a fraction of the storage cost.

Alerts that mean something

The final piece, and the one most often broken, is alerting.

Most alert configurations I’ve seen have two failure modes: too many alerts, or too few. Too many and the team learns to ignore them — alert fatigue is real and it kills the value of the entire observability stack. Too few and you find out about problems from users rather than from your systems.

The principle I’ve found most useful: alert on symptoms, not causes. Alert on what users experience, not on what you think might cause a bad experience.

# Wrong: alerting on causes
- alert: HighCPUUsage
  expr: process_cpu_usage > 0.8
  for: 5m
  annotations:
    summary: "CPU usage is high"

# Right: alerting on symptoms
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status_code=~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))
    > 0.01
  for: 2m
  annotations:
    summary: "Error rate above 1% for 2 minutes"
    runbook: "https://wiki.internal/runbooks/high-error-rate"

- alert: HighLatency
  expr: |
    histogram_quantile(
      0.95,
      sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
    ) > 1.0
  for: 5m
  annotations:
    summary: "p95 latency above 1s on {{ $labels.endpoint }}"
    runbook: "https://wiki.internal/runbooks/high-latency"

- alert: DownstreamHighErrorRate
  expr: |
    sum(rate(downstream_calls_total{status!="success"}[5m])) by (service)
    /
    sum(rate(downstream_calls_total[5m])) by (service)
    > 0.05
  for: 3m
  annotations:
    summary: "{{ $labels.service }} error rate above 5%"
    runbook: "https://wiki.internal/runbooks/downstream-errors"

Every alert has a runbook link. Not because the runbook will always have the answer, but because the act of writing the runbook forces you to think through the likely causes and the investigation steps before the incident, when you have time to think clearly rather than under pressure.

CPU at 80% does not mean users are having a bad experience. It might not mean anything at all. A 1% error rate means one in a hundred requests is failing. That is always a bad experience for someone and always worth waking someone up.

The compounding return

Here is the thing about observability that doesn’t show up in the initial cost-benefit analysis: it compounds.

The first time you have a production incident with proper observability in place, you diagnose it in ten minutes instead of two hours. That’s valuable by itself. But the more important benefit is what happens after: you have a precise understanding of what happened. Not a vague post-mortem narrative about “suspected database issues” but a specific trace showing that query X took 40 seconds because index Y was missing. You add the index. You add an alert for slow queries. The next developer who accidentally introduces a slow query gets caught in staging, not production.

Systems with good observability improve over time in a way that systems without it don’t. Every incident teaches you something specific that you can encode into an alert, a dashboard, or a runbook. The team gets faster at diagnosis. The system gets more instrumented. The incidents get shorter.

Systems without observability teach you nothing you can act on. The incident happened, you restarted the service, it’s working now, let’s hope it doesn’t happen again. It will happen again. You won’t be more prepared.

The investment in observability is not primarily about making incidents shorter, though it does that. It’s about turning your production system into something that teaches you how it behaves, so that over time you understand it better and break it less.

That’s worth the setup cost. It’s worth the ongoing cost. It’s worth doing before you think you need it, because the time you need it is not a time you’ll have to set it up.

Build the instrumentation before the incident. The incident is coming.