Event-Driven Architecture: An Honest Assessment

Event-driven systems are elegant in talks and brutal in production. After building and operating them across multiple companies, here is what nobody tells you before you commit to the pattern.

event driven architecture

Every few years the industry rediscovers event-driven architecture and decides it is the answer. The talks are compelling. Services decoupled from each other. No direct dependencies. Producers that emit events and never think about who consumes them. Consumers that react to what happened and never worry about who caused it. The system as a whole becomes a collection of independent actors responding to a shared stream of facts about the world.

In the talk, this is clean. In production, it is one of the most operationally demanding patterns in software engineering, and the gap between how it is pitched and what it costs to run it well is wider than almost any other architectural pattern I can think of.

I have built event-driven systems that worked well and event-driven systems that were disasters. The difference was not the technology and it was not the team’s capability. It was whether the team went in with an accurate picture of what they were buying. Most teams do not get that picture before they commit. This article is an attempt to provide it.

What you actually get

Start with what is genuinely good, because there is genuine good here.

Decoupling that is real. When the order service publishes an OrderPlaced event and knows nothing about who consumes it, and when the inventory service consumes OrderPlaced and knows nothing about who published it, you have achieved something meaningful. Either service can be redeployed without the other. Either can evolve its internal implementation without negotiating with the other. A new service can start consuming OrderPlaced tomorrow without touching the order service at all.

This decoupling is the thing that makes large organisations with many teams possible. The team that owns the order service does not need to be in a meeting with every team that cares about orders. They publish the event. Every consumer team builds and maintains their own reaction to it. The coordination that would have been synchronous and blocking becomes asynchronous and independent.

Audit trails that emerge naturally. If your events are your source of truth, you have a complete record of everything that happened in your system in the order it happened. Not just the current state, but the history of how you got there. This is genuinely useful for debugging, for compliance, and for the class of bugs that are almost impossible to diagnose without knowing what sequence of events preceded them.

Load handling that is structural rather than bolted on. A consumer that reads from a queue processes work at the rate it can handle, regardless of the rate at which work arrives. The queue absorbs the spike. The consumer processes the backlog when capacity is available. This is structurally different from a synchronous system where a traffic spike hits the service directly and the service either handles it or falls over.

These are real benefits. They are worth having. They are also not free.

The cost nobody quotes you upfront

Eventual consistency is not a configuration option, it is a commitment.

When you move from synchronous calls to events, you give up the ability to know, at any given moment, that all parts of your system agree on the current state. The order was placed. The OrderPlaced event was published. The inventory service will consume it and reserve the stock. When? Soon. How soon? Depends on the consumer’s lag. What if the user queries their order status right now, before the inventory service has processed the event? The order exists in the order service’s view. The inventory has not yet been reserved. The system is in an intermediate state that is internally consistent but not yet globally consistent.

For many use cases this is acceptable. For some it is not. The teams that adopt event-driven architecture without thinking carefully about which of their use cases fall into which category discover the hard way that “eventually” can mean milliseconds, seconds, or minutes depending on what else is happening, and that users do not have the same patience for eventual consistency that architects do.

# The problem that bites teams who haven't thought this through:

# User places an order. Order service publishes event.
async def place_order(user_id: str, items: list) -> Order:
    order = Order.create(user_id=user_id, items=items)
    await db.save(order)
    await event_bus.publish(OrderPlaced(order_id=order.id, items=items))
    return order

# User immediately queries order status.
# Order service returns order with status "pending".
# Inventory service has not yet processed the event.
# Frontend shows "pending" with a spinner.
# User refreshes. Still pending. Refreshes again.
# Inventory event processes. Status updates to "confirmed".
# User has refreshed four times and is on the phone with support.

# The system was correct the entire time.
# The user experience was broken the entire time.
# These are not contradictory.

# What teams often do to address this:
# Read-your-own-writes consistency for the immediate response,
# combined with a clear UI state that communicates processing is happening.

async def place_order_with_ux_in_mind(user_id: str, items: list) -> dict:
    order = Order.create(user_id=user_id, items=items)
    await db.save(order)
    await event_bus.publish(OrderPlaced(order_id=order.id, items=items))

    return {
        "order_id": order.id,
        "status": "processing",
        "message": "Your order is being confirmed. This usually takes a few seconds.",
        "poll_url": f"/orders/{order.id}/status",
        # Give the client a signal about what to do next,
        # rather than leaving them in ambiguous pending state
    }

Debugging across services is a different discipline entirely.

In a synchronous system, a bug has a call stack. You look at the stack trace and you see exactly what called what in what order. The sequence is right there.

In an event-driven system, the equivalent of a call stack is a trace across multiple services, potentially across multiple events, potentially hours or days apart. OrderPlaced fires. InventoryReserved fires. PaymentProcessed fires. FulfillmentCreated fires. ShipmentCreated fires. The user reports that their order is stuck. You need to find where in this sequence something went wrong, knowing that each step is in a different service with different logs, possibly with events that were consumed out of order, possibly with a consumer that failed silently and moved on.

Without distributed tracing that propagates correlation IDs across every event, this debugging is archaeology. You are sifting through log files from multiple services trying to reconstruct what happened to a specific order that a specific user placed at a specific time.

# Event envelope that makes debugging survivable.
# Every event carries a correlation ID from the originating request.
# Every subsequent event in the chain inherits it.
# Every service logs it with every operation related to the event.
# When debugging, filter all service logs by correlation ID
# and you reconstruct the full sequence.

from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Any
import uuid


@dataclass
class Event:
    event_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    event_type: str = ""
    correlation_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    causation_id: str = ""    # ID of the event that caused this one
    occurred_at: str = field(
        default_factory=lambda: datetime.now(timezone.utc).isoformat()
    )
    schema_version: str = "1.0"
    payload: dict = field(default_factory=dict)

    def caused_by(self, parent_event: "Event") -> "Event":
        self.correlation_id = parent_event.correlation_id
        self.causation_id = parent_event.event_id
        return self


# When the inventory service handles OrderPlaced and emits InventoryReserved:
async def handle_order_placed(event: Event) -> None:
    order_id = event.payload["order_id"]

    log.info(
        "inventory.reservation.started",
        correlation_id=event.correlation_id,
        order_id=order_id,
    )

    await reserve_inventory(order_id)

    reserved_event = Event(
        event_type="InventoryReserved",
        payload={"order_id": order_id},
    ).caused_by(event)   # Inherits correlation_id, sets causation_id

    await event_bus.publish(reserved_event)

    log.info(
        "inventory.reservation.completed",
        correlation_id=event.correlation_id,
        order_id=order_id,
        caused_event_id=reserved_event.event_id,
    )

With this envelope, every event in a chain shares a correlation ID. Every log line from every service that handled any event in the chain includes that correlation ID. Debugging a stuck order is a single log query: show me every log line with this correlation ID, across all services, sorted by time.

Without this, you do not have an event-driven system you can operate. You have a system that works until it breaks and then you cannot find out why.

Consumer failures are invisible by default.

In a synchronous system, a failure is loud. The call throws an exception. The caller gets an error. Someone notices.

In an event-driven system, a consumer failure can be completely silent. The consumer reads the event, fails to process it, and depending on how it is configured, either requeues the event, moves it to a dead letter queue, or discards it and moves on. The producer never knows. The other consumers never know. The user whose order triggered the event never knows until they notice that something downstream has not happened.

Dead letter queues are the standard answer to this and they are the right answer, but they are only useful if someone is watching them. A dead letter queue that nobody monitors is not a safety net. It is a place where failed events go to be forgotten.

# Dead letter queue monitoring that actually alerts

import boto3
from prometheus_client import Gauge

sqs = boto3.client("sqs")

DLQ_DEPTH = Gauge(
    "sqs_dlq_message_count",
    "Number of messages in dead letter queue",
    ["queue_name"]
)


async def check_dlq_depths():
    queues = [
        ("order-processing-dlq", "https://sqs.region.amazonaws.com/account/order-processing-dlq"),
        ("inventory-dlq", "https://sqs.region.amazonaws.com/account/inventory-dlq"),
        ("payment-dlq", "https://sqs.region.amazonaws.com/account/payment-dlq"),
        ("fulfillment-dlq", "https://sqs.region.amazonaws.com/account/fulfillment-dlq"),
    ]

    for queue_name, queue_url in queues:
        response = sqs.get_queue_attributes(
            QueueUrl=queue_url,
            AttributeNames=["ApproximateNumberOfMessages"]
        )
        depth = int(response["Attributes"]["ApproximateNumberOfMessages"])
        DLQ_DEPTH.labels(queue_name=queue_name).set(depth)


# Alert rule: any DLQ with more than 0 messages is an incident.
# Not a warning. An incident.
# A message in the DLQ means an event failed to process.
# That means something the user expected to happen did not happen.
# That is always worth investigating immediately.

Any message in a dead letter queue is a symptom of a real problem. Not a might-be-a-problem. A real problem. Treating DLQ depth as a metric that alerts at zero normalises the expectation that failures are real and visible, rather than the expectation that failures are background noise to be managed.

The schema problem that grows until it bites you

Events are a public interface. Once a consumer is reading your events, the schema of those events is a contract. Changing the schema breaks the consumer.

In a synchronous API, schema evolution is managed through versioning. The producer runs V1 and V2 of the endpoint simultaneously. Consumers migrate at their own pace. When all consumers have migrated, V1 is deprecated.

In an event-driven system, the equivalent is possible but operationally harder. If you change the OrderPlaced event schema, you need every consumer of OrderPlaced to be updated before you change the schema, or the consumer needs to handle both old and new schemas simultaneously, or you need to maintain two event types in parallel during migration. None of these options is cheap, and they are cheaper if you planned for them than if you did not.

The teams that handle this well establish schema governance before they have a schema problem. Not after.

# Schema versioning that makes evolution manageable.
# Every event schema has an explicit version.
# Consumers declare which versions they can handle.
# The event bus routes accordingly.

from pydantic import BaseModel
from typing import Literal
from datetime import datetime


# Version 1 of OrderPlaced
class OrderPlacedV1(BaseModel):
    schema_version: Literal["1.0"] = "1.0"
    order_id: str
    user_id: str
    total: float
    occurred_at: datetime


# Version 2 adds line items and changes total to be in cents
# to avoid floating point issues that V1 had
class OrderPlacedV2(BaseModel):
    schema_version: Literal["2.0"] = "2.0"
    order_id: str
    user_id: str
    total_cents: int          # Changed: was "total: float"
    currency: str             # Added
    items: list[dict]         # Added
    occurred_at: datetime


# Consumer that handles both versions during migration period
class InventoryConsumer:
    async def handle(self, raw_event: dict) -> None:
        version = raw_event.get("schema_version", "1.0")

        if version == "1.0":
            event = OrderPlacedV1(**raw_event)
            order_id = event.order_id
            # V1 doesn't have items, so we have to fetch them
            items = await self.order_service.get_items(order_id)
        elif version == "2.0":
            event = OrderPlacedV2(**raw_event)
            order_id = event.order_id
            items = event.items
        else:
            log.error(
                "inventory.consumer.unknown_schema_version",
                version=version,
                event_id=raw_event.get("event_id"),
            )
            raise UnknownSchemaVersionError(version)

        await self.reserve_inventory(order_id=order_id, items=items)

Schema versioning adds boilerplate. The alternative is finding out during an incident that a schema change broke a consumer that nobody knew was depending on the old format.

When event-driven architecture is the wrong answer

The pattern is not universally appropriate. Teams adopt it when they should not, attracted by the elegance and the conference talks, and then spend years paying costs that were not necessary.

When you have one team and one service. Events are an organisational boundary mechanism. If there is no organisational boundary, you are paying the operational cost of distributed messaging for no architectural benefit. A function call is faster, simpler, and easier to debug. A modular monolith with internal domain events gives you the architectural thinking without the operational overhead.

When your operations require immediate consistency. Financial transactions. Inventory deduction that must be accurate at the moment of purchase. Medical record updates. Any situation where the user or the business cannot tolerate the state being temporarily inconsistent. Eventual consistency is not a technical property to be engineered around in these cases. It is a fundamental unsuitability.

When your team does not have the operational maturity for it. Event-driven systems require distributed tracing. They require DLQ monitoring. They require schema governance. They require expertise in at least one message broker technology. They require runbooks for consumer failure scenarios. Teams that are still establishing basic engineering practices should not add this operational surface area. Stabilise first. Adopt the pattern when you have the capacity to operate it correctly.

When the communication pattern is inherently synchronous. A user submits a form and expects a result. An API client makes a request and needs a response before it can proceed. A batch job reads data and produces a report. These patterns do not become better by adding events between the steps. They become more complex with no benefit. Forcing an asynchronous pattern onto an inherently synchronous workflow is an architecture astronaut move, not an engineering decision.

The systems that work

The event-driven systems that work well in production share a set of properties that are not negotiable.

Every event carries enough context to be processed without fetching additional data. A consumer that needs to make a synchronous call to process an event has a dependency on the producer that the event pattern was supposed to eliminate.

Every consumer is idempotent. Events can be delivered more than once. A consumer that is not idempotent will produce duplicate effects when this happens. Designing for idempotency upfront is straightforward. Retrofitting it after duplicate processing has caused data integrity issues is expensive.

# Idempotent consumer using a processed events log
async def handle_order_placed(event: Event) -> None:
    # Check if we have already processed this event
    already_processed = await processed_events.exists(event.event_id)
    if already_processed:
        log.info(
            "inventory.consumer.duplicate_event_skipped",
            event_id=event.event_id,
            correlation_id=event.correlation_id,
        )
        return

    # Process the event
    await reserve_inventory(event.payload["order_id"])

    # Record that we have processed it
    await processed_events.record(
        event_id=event.event_id,
        processed_at=datetime.now(timezone.utc),
        consumer="inventory-service",
    )

Every schema change goes through a review process before it is deployed. The review asks: which consumers will be affected? Have they been updated? Can the change be made backwards-compatible? If not, what is the migration plan?

Every dead letter queue has an alert and an owner. DLQ messages are investigated the same day they appear. Not triaged. Not backlogged. Investigated.

The teams that run event-driven systems well have built this infrastructure. It is not glamorous work. It does not appear in the conference talk about the elegant decoupling. It is the thing that makes the decoupling survivable in practice.

The honest version of the pitch

Event-driven architecture is worth adopting when you have multiple teams that need genuine autonomy, when your use cases can tolerate eventual consistency, when you have the operational maturity to run distributed messaging in production, and when the benefits of decoupling outweigh the costs of asynchrony and distribution.

In those circumstances it is genuinely powerful. The teams that use it well will tell you they cannot imagine going back. The operational complexity feels like a fair trade for the organisational flexibility.

In other circumstances it is an expensive mismatch between pattern and problem. The complexity of the pattern does not disappear because you chose it for the wrong reasons. It stays. You pay it.

The evaluation should be honest about both sides. Not “events are the modern way to build systems” which is a fashion statement. Not “events are always too complex” which ignores where they genuinely excel. The honest version: this pattern solves specific organisational and scalability problems at a specific operational cost. Do the problems apply to us? Can we afford the cost? Those are the questions worth asking before you commit.

The answer is sometimes yes. It is not always yes. And pretending otherwise is how teams end up with event-driven systems they cannot operate and cannot easily escape.