ArchitectureDevOpsCareer

Microservices Were Never About Technology

Microservices

The microservices conversation in most engineering teams goes something like this. The monolith is getting unwieldy. Deployments are slow. The codebase is hard to navigate. A senior engineer proposes breaking things apart into services. The team agrees. They spend six months doing it. Things get worse.

The services are too small or too large. Nobody agrees on where the boundaries should be. Simple features now require coordinating changes across three repositories. A bug that used to take twenty minutes to debug now takes two hours because it crosses service boundaries. Deployments are more frequent but individually more fragile. The team is working harder than before and moving slower than before.

They blame the technology. The technology is not the problem.

Microservices failed for them for the same reason they fail for most teams that adopt them: the team treated decomposition as a technical decision and ignored the organisational reality that microservices are actually designed to solve. You cannot separate the architecture from the team structure. Conway’s Law is not optional.

What microservices were actually invented for

Amazon is the origin story most people know. The mandate from Jeff Bezos in the early 2000s: all teams will expose their data and functionality through service interfaces. All teams will communicate through these interfaces. No other form of interprocess communication is allowed. Anyone who doesn’t do this will be fired. He was serious.

The problem Bezos was solving was not technical. Amazon’s engineering organisation had grown to a size where teams were deeply entangled with each other. Team A could not deploy without coordinating with Team B, which needed sign-off from Team C, which had a dependency on Team D. Every change required a synchronisation meeting. Every release was a negotiation. The coupling between teams was strangling the organisation’s ability to move.

Services were the solution because they forced a contract between teams. If Team A owns Service A, and Team B owns Service B, and they communicate only through a defined API, then Team A can change anything inside Service A without asking Team B’s permission. Team B can deploy Service B on its own schedule. The organisational autonomy is enforced by the technical boundary.

This is the thing that most microservices adoptions miss entirely. Services are not primarily a way to scale technology. They are a way to scale teams. The technical properties of services (independent deployment, technology flexibility, fault isolation) are valuable side effects of the organisational property (team autonomy).

If you adopt microservices without the organisational changes that make them valuable, you get all the costs of distributed systems and none of the benefits.

The cost of distribution

A monolith, for all its problems, has properties that distributed systems do not have and cannot have.

A function call inside a monolith is reliable. Either it works or it throws an exception. It is fast. It completes in microseconds. It participates in the same database transaction as the code that called it. If the whole operation needs to be rolled back, it is.

A network call between services is unreliable. It might succeed. It might fail. It might succeed on the server side and fail on the network before the response reaches the caller. It might time out, leaving you with no information about whether the remote operation completed. It is slow relative to a function call. It crosses a transaction boundary, which means if something fails after the call succeeded, you have a consistency problem that cannot be resolved by a rollback.

This is not an implementation detail to be engineered around. It is a fundamental property of distributed systems, described precisely in the fallacies of distributed computing that Peter Deutsch wrote in 1994 and that the industry has been rediscovering ever since.

The network is not reliable. Latency is not zero. Bandwidth is not infinite. The network is not secure. Topology changes. There is not one administrator. Transport cost is not zero. The network is not homogeneous.

Every service boundary you add to your system is a place where these fallacies apply. Every service call is an opportunity for latency, failure, and consistency problems that simply do not exist inside a monolith. The question is whether the organisational benefits of the boundary justify the distributed systems cost of maintaining it.

For a team of eight people working on one product, they almost never do.

Where the boundaries actually belong

The most common failure mode in microservices adoption is drawing service boundaries around technical concerns rather than business ones. Teams create an “auth service,” a “notification service,” a “user service,” a “payment service.” These feel like natural decompositions because they map to recognisable technical concepts.

They are terrible service boundaries.

An auth service that every other service must call to validate a token is not a service. It is a shared library that has been deployed as infrastructure, adding network latency and a failure mode to every authenticated request in the system. If the auth service is slow, everything is slow. If the auth service is down, everything is down. You have taken a piece of logic that could live as a function call and made it a distributed systems problem.

A notification service is not a service. It is a collection of side effects that have been externalized, creating a situation where the service that wants to send an email must make a network call, handle the failure case, and figure out what to do if the notification service is unavailable at the moment the email needs to be sent.

The boundaries that work are the ones that map to bounded contexts in the business domain. Not “the thing that handles auth” but “the thing that owns everything about how customers interact with our platform.” Not “the thing that sends notifications” but “the thing that owns the customer communication history and all the rules about when and how to communicate.”

These boundaries are harder to identify. They require understanding the business deeply enough to know where the real seams are. They require conversations with product managers and domain experts, not just with engineers. They change as the business evolves. But they are the boundaries that, when you respect them, produce services that teams can own autonomously and evolve independently.

Domain-Driven Design’s concept of bounded contexts is the clearest framework for finding these boundaries. The bounded context defines the scope within which a particular domain model applies. At the edge of the bounded context, the model changes. That is where the service boundary belongs.

# A service boundary drawn around a technical concern.
# Every other service calls this. Auth is now a distributed dependency.
#
# Bad:
class AuthService:
    async def validate_token(self, token: str) -> User:
        ...
    async def create_token(self, user_id: str) -> str:
        ...
    async def revoke_token(self, token: str) -> None:
        ...


# A service boundary drawn around a business capability.
# This service owns everything about an order, including its auth context.
# Other services don't call into it for auth. They communicate
# through events when they need to know something happened.
#
# Better:
class OrderService:
    async def place_order(self, customer_id: str, items: list) -> Order:
        # Auth context is resolved here, not farmed out to a network call
        customer = await self.customer_repository.get(customer_id)
        if not customer.can_place_orders():
            raise InsufficientPermissionsError()
        ...

    async def cancel_order(self, order_id: str, requesting_customer_id: str) -> None:
        order = await self.order_repository.get(order_id)
        if order.customer_id != requesting_customer_id:
            raise InsufficientPermissionsError()
        ...

Conway’s Law is a constraint, not a suggestion

Mel Conway observed in 1967 that organisations produce systems that mirror their communication structures. A team with three groups will produce a system with three components. This is not because they planned to. It is because the system reflects who talks to whom.

The implication that most teams don’t fully absorb: if you want a particular system architecture, you need the corresponding organisational structure. You cannot have a microservices architecture with a team structure designed for a monolith. The organisation will fight the architecture until one of them wins, and the organisation usually wins because it existed first.

This is why Amazon’s microservices worked. The service boundaries and the team boundaries were the same boundaries. Team A owns Service A. Not “Team A and Team B both contribute to Service A.” Not “Service A is maintained by whoever has time.” One team, one service, full ownership. The organisational autonomy and the technical autonomy were the same thing.

Most microservices adoptions separate these. The same team that used to work on the monolith now works on six services. They have all the coordination overhead of distributed systems and none of the team autonomy that makes it worth it. They still talk to each other constantly because they’re the same people. The service boundaries don’t reflect team boundaries because there are no team boundaries. There is one team doing distributed systems for no organisational reason.

The inverse Conway maneuver, a term coined by Thoughtworks, is the deliberate version: you design the team structure you want, then let the architecture follow from it. If you want a payments service that can be developed and deployed independently, you need a payments team that can make decisions and ship code independently. If you do not have or cannot create that team, you do not have the prerequisite for the payments service. The prerequisite check before splitting a service:

Who will own this service? “The backend team” is not an answer. A named, stable, small team is an answer. Can that team deploy the service without coordinating with other teams? If not, the boundary is wrong or the ownership is wrong. Can that team change the service’s internal implementation without changing any other service? If not, the boundary is wrong. Is there a defined contract (API, event schema) between this service and its consumers? If not, you don’t have a service. You have a distributed module. Does the team have enough context about the business domain this service represents to make good decisions autonomously? If not, the team needs to exist and stabilise before the service should be extracted.

If any of these answers is no, the split is premature.

The operational surface nobody accounts for

When a team decides to split their monolith into ten services, they usually have a plan for the technical decomposition. They rarely have a plan for what they are about to own operationally.

A monolith has one deployment pipeline. One set of infrastructure to configure. One place to look at logs. One set of metrics. One runbook for when things go wrong. The operational complexity is low.

Ten services have ten deployment pipelines. Ten infrastructure configurations. Log aggregation that spans services. Distributed tracing to follow a request through multiple services. Ten runbooks, except the incidents that matter will involve multiple services and none of the runbooks will cover that. Service discovery. Health checking at the inter-service level. Circuit breakers for when downstream services are degraded.

None of this complexity is impossible to manage. It is all solvable. But it requires a team that has the capacity to manage it, tools that have been set up before the split happens, and expertise that takes time to develop.

Most teams split their services and then build the operational infrastructure retroactively, while also trying to deliver product work, while also debugging the new distributed systems problems they did not have before. This is where the eighteen months of slowdown comes from.

The teams that do this well build the operational infrastructure first. They get distributed tracing working in the monolith before they split it. They standardise their deployment pipeline before they have ten of them. They establish logging conventions before they have ten services emitting logs in subtly different formats.

# The operational baseline that must exist before splitting services.
# This is not optional infrastructure to add later.

# Centralised structured logging
logging:
  format: json
  fields:
    service: ${SERVICE_NAME}
    version: ${SERVICE_VERSION}
    environment: ${ENVIRONMENT}
    trace_id: ${TRACE_ID}    # Must be propagated across service calls
    span_id: ${SPAN_ID}

# Every service exposes these endpoints. No exceptions.
health_endpoints:
  liveness: /healthz      # Is the process running?
  readiness: /ready       # Is it ready to serve traffic?
  metrics: /metrics       # Prometheus metrics

# Every inter-service call propagates these headers
trace_propagation:
  headers:
    - traceparent           # W3C Trace Context
    - tracestate

# Every service has these alerts configured before it handles traffic
minimum_alerts:
  - error_rate_above_1_percent
  - p99_latency_above_1_second
  - service_unavailable

The monolith that should stay a monolith

Not every system should be microservices. This is easy to say and hard to accept in an industry where microservices became the mark of a serious engineering organisation.

The monolith that should stay a monolith is the one where:

The team is small enough that coordination overhead is low. Five to eight engineers can coordinate in a daily standup without the synchronisation cost becoming significant. For a team this size, the organisational problem that microservices solve does not exist.

The domain is not yet well understood. Early-stage products have unstable domain models. The concepts that seem fundamental change as you learn what you’re actually building. Service boundaries drawn around an unstable domain model have to be redrawn as the domain stabilises, which is expensive and demoralising. The monolith lets the domain model evolve cheaply. Split when the domain is understood.

The operational team does not exist. If nobody owns the infrastructure that a distributed system requires, the system will be operated badly. A well-operated monolith beats a poorly-operated distributed system every time.

The internal structure can be improved without splitting. A modular monolith with clear internal boundaries, enforced through package structure and dependency rules, provides most of the cognitive benefits of microservices (clear ownership, bounded contexts, interface discipline) without the distributed systems cost. It is not a compromise. For the right team and domain, it is the correct architecture.

# A modular monolith with enforced boundaries.
# orders/ cannot import directly from payments/.
# They communicate through defined interfaces.
# This is achievable without distributed systems.

# src/orders/service.py
from orders.repository import OrderRepository
from orders.events import OrderPlaced  # Orders emits events
# from payments.service import PaymentService  # This import is forbidden
                                               # enforced by linting rules

class OrderService:
    def __init__(
        self,
        repository: OrderRepository,
        event_bus: EventBus,
        payment_gateway: PaymentGateway,  # Interface, not concrete payments module
    ):
        self.repository = repository
        self.event_bus = event_bus
        self.payment_gateway = payment_gateway

    async def place_order(self, customer_id: str, items: list) -> Order:
        order = Order.create(customer_id=customer_id, items=items)
        await self.repository.save(order)
        await self.event_bus.publish(OrderPlaced(order_id=order.id))
        return order

# The payments module listens for OrderPlaced events.
# It never gets called directly by orders.
# The boundary is real. It is enforced by design, not by a network.

# src/payments/handlers.py
from orders.events import OrderPlaced  # Reading event schema is allowed

class PaymentEventHandler:
    async def on_order_placed(self, event: OrderPlaced) -> None:
        await self.payment_service.initiate_payment(order_id=event.order_id)

This is a real architecture that scales further than most teams think before the overhead of splitting services becomes worth paying. Shopify ran a version of this for years. Stack Overflow still does. They are not unsophisticated organisations.

What the good teams understand

The teams that have figured out distributed systems share a perspective that took most of them several years and at least one failed microservices adoption to arrive at.

Services are not about code organisation. They are about team organisation. A service boundary that does not correspond to a team boundary is overhead without benefit.

The overhead of distributed systems is real, permanent, and compounding. You pay it forever. It needs to buy something worth having. For a team that is too large to coordinate, team autonomy is worth having. For a team that is not yet at that size, it is not.

The correct direction of reasoning is: we have an organisational problem, what architecture solves it? Not: we have an architecture trend, what organisation do we need to adopt it?

Microservices adopted as a technical decision produce the costs of distribution and the politics of boundary negotiation without the autonomy that makes them valuable. Microservices adopted as an organisational decision, by teams that have done the work of defining ownership and building operational foundations, produce systems that actually deliver what the pattern promises.

The technology has never been the hard part. The hard part is everything the technology forces you to sort out first. Most teams skip that part and wonder why the technology failed them.