The Infrastructure That Nobody Owns

There is a category of infrastructure that exists in almost every engineering organisation above a certain size, and it is almost never talked about directly.

It is not the infrastructure that is broken and causing incidents. Everyone knows about that. It is not the infrastructure that is being actively built or improved. That has owners and roadmaps and standups.

It is the infrastructure that is working. The system that has been running for three years without a major incident. The internal tool that twelve teams depend on every day. The shared service that was built by a team that no longer exists, maintained by nobody in particular, understood in full by nobody at all.

This infrastructure is the most dangerous thing in the organisation, precisely because its danger is invisible. It works, so it attracts no attention. It is depended upon, so it cannot be easily removed. It is not owned, so nobody is responsible for keeping it working. It is a ticking clock that the organisation has collectively agreed not to look at.

How it gets this way

No system starts without an owner. Someone built it. Someone made the architectural decisions. Someone wrote the runbook, or at least knew what to do when things went wrong.

The ownership decays through events that are individually unremarkable.

The team that built the system gets reorganised. The new team takes on the system as part of their portfolio but also inherits twelve other things. The system works fine, so it gets no attention. The team’s roadmap is defined by what they are building, not by what they are maintaining. The inherited system sits in a corner of the backlog, deprioritised indefinitely because there is always something more urgent.

The engineers who understood the system move on. Not all at once. One leaves for another company. Another moves to a different team. A third is promoted into a role where they no longer write code. Each departure takes a portion of the institutional knowledge with it. The remaining knowledge concentrates in one or two people, which looks fine until those people leave too.

The documentation, if it exists, was written when the system was first built and has not been updated since. The world has changed. The system has changed, in small ways, through patches and quick fixes that were never reflected in the documentation. The documentation describes a system that no longer exists exactly, which is almost worse than no documentation because it creates false confidence.

The dependency graph grows without anyone tracking it. A new service is built that calls into the shared system. Another team starts using the internal tool because it solves their problem and it is already there. The system’s usage grows while its ownership shrinks. The ratio of dependence to accountability becomes increasingly dangerous.

What nobody-owns looks like from the inside

The engineers who work adjacent to unowned infrastructure develop a particular relationship with it that is worth describing precisely, because it explains why the problem persists even when everyone is aware of it.

They treat it as an environmental condition rather than a system. Weather is not owned. Gravity is not owned. The unowned infrastructure occupies the same cognitive category. It is there. It works or it does not. You work around it. You do not feel responsible for fixing it because you did not break it and it is not yours to fix.

They build local knowledge about its failure modes and do not share that knowledge systematically. An engineer who has been at the organisation long enough learns that the shared message queue needs to be restarted every six weeks or it starts dropping messages. They put a reminder in their calendar. They do not write a runbook because writing a runbook would require claiming some ownership of the system, and nobody wants to own it because ownership brings accountability without authority.

They route around it when they can. New services are built to avoid depending on the unowned system even when that creates duplication, because depending on it means being affected when it breaks and not being able to fix it. The duplication is inefficient. It is also rational given the incentives.

They develop a superstition about it. “Don’t touch that service. Every time someone touches it something breaks.” The superstition is not wrong exactly. Changes to systems that are not well understood do tend to cause unexpected failures. But the superstition prevents the investigation that would lead to understanding, which would make changes safer, which would allow the system to be improved. The superstition is self-reinforcing.

The incident that reveals it

The unowned system’s danger becomes concrete during incidents, and the specific texture of those incidents is different from incidents involving owned systems.

When an owned system fails, the response has a shape. The owner is paged. The owner has context. They know the system, they know its failure modes, they have seen incidents before and have intuitions about where to look. The runbook may be incomplete but it is a starting point. The incident resolves faster because the people responding to it are not starting from zero.

When an unowned system fails, the response has a different shape. The alert fires. The team that is on call is on call for their own services, not for this one. They look at the alert and their first question is not “what is wrong” but “what is this.” They start reading code they have never seen, written by people who are no longer at the company, solving problems they do not have full context for.

The incident takes longer not because the problem is harder but because the problem-solvers have no context. Every diagnostic step that would take two minutes for someone who knows the system takes twenty minutes for someone who is meeting the system for the first time in the middle of an incident.

The resolution, when it comes, is often a restart or a workaround rather than a fix. Nobody has enough confidence in the system to make a real fix under pressure. The restart works. The incident is closed. The underlying problem is not addressed because addressing it would require understanding the system, which would require time that is not available during an incident and is not allocated after it because the system is working again and there is always something more urgent.

The cycle continues.

# What the on-call engineer finds when the unowned system breaks:

# Last meaningful commit: 847 days ago
# Last commit author: engineer who left 2 years ago
# Tests: 12 tests, 9 failing, 3 passing for unclear reasons
# Documentation: README last updated when the service was launched
# Dependencies: requests==2.18.0, released 2017
# Monitoring: one alert, "service_down", no other metrics

# The incident response log:
#
# 02:14 - Alert fires: payment_gateway_service health check failing
# 02:15 - On-call acknowledges, begins investigation
# 02:31 - On-call: "I don't know what this service does exactly,
#          reading the code"
# 02:47 - On-call: "Found a connection pool leak, not sure how to fix safely"
# 02:51 - On-call: "Restarting the service"
# 02:52 - Service recovers
# 02:53 - On-call: "Marking resolved, will investigate tomorrow"
# 
# Tomorrow: three urgent features take priority
# Connection pool leak: never investigated
# Next incident: 34 days later, same cause


# What an owned system looks like:

# Runbook entry for connection pool exhaustion:
#
# SYMPTOMS: health check failing, logs show "connection pool exhausted"
# CAUSE: the pool has a known leak when requests timeout mid-query.
#        This happens roughly every 6 weeks under normal load.
# IMMEDIATE FIX: restart the service (./scripts/restart-payment-gateway.sh)
# PERMANENT FIX: tracked in issue #2847, scheduled for Q3
# VERIFY: after restart, check pool metrics at /metrics endpoint,
#         pool_connections_idle should be > 5
# ESCALATE: if restart does not resolve within 5 minutes, page
#           the platform team lead
#
# Last updated: 2026-03-15
# Incident history: 3 occurrences, all resolved by restart

The organisational conditions that create it

Unowned infrastructure is not the result of individual negligence. It is the predictable output of specific organisational conditions.

Ownership is defined by what teams build, not by what teams maintain. Team roadmaps are about new capabilities. The incentive structure rewards shipping features. The performance review process asks what you delivered. Maintenance work, the unglamorous work of keeping existing systems healthy, is invisible to this system. Nobody gets promoted for keeping the shared message queue running reliably for three years. So teams do not prioritise it, individually rational choices that are collectively dangerous.

Staffing decisions treat maintenance as free. When a team builds a new internal service and then gets reorganised away, the organisation implicitly assumes that the system will continue to maintain itself. It will not. Systems require ongoing attention. Dependencies need updating. Failure modes need investigating. Documentation needs maintaining. The assumption that a working system requires no effort is consistently wrong and consistently made.

Platform and infrastructure work is chronically understaffed relative to its criticality. In most engineering organisations, the ratio of product engineers to platform engineers is large. The product engineers build things that depend on infrastructure. The platform engineers maintain the infrastructure that everything depends on. When the platform team is small, the amount of infrastructure they can meaningfully own is limited. The rest goes unowned.

The cost of this arrangement is hidden in incident duration, in developer productivity lost to unreliable tooling, in the routing-around behaviour that creates duplication, and in the risk carried by systems that are one bad day away from an incident nobody knows how to resolve. These costs are real and they do not appear on any budget line.

Making ownership real

Naming an owner is not the same as creating ownership. The common response to discovering unowned infrastructure is to assign it to a team. The team accepts the assignment. Nothing changes. The system continues to be effectively unowned because the assignment did not come with time, authority, or incentive to actually own it.

Real ownership requires four things, and an assignment that does not provide all four is an assignment in name only.

Time. The team that owns the system needs time allocated to maintaining it. Not time they carve out from feature delivery under competing pressure. Explicitly allocated time that their manager protects. Ten to twenty percent of the team’s capacity for maintenance work is the threshold below which meaningful ownership does not happen.

Authority. The owner needs to be able to make decisions about the system without seeking approval from teams that depend on it. If changing the system requires consensus from the ten teams that use it, the owner cannot maintain it effectively. The owner sets the direction. Consumers provide input. They do not hold veto power.

Visibility. The health of owned systems needs to be visible to the organisation in a way that is connected to the owner’s standing. If the shared service has three incidents in a month and nobody connects that to the team that owns it, ownership has no accountability. If the team’s reliability metrics are visible and matter to how the team is evaluated, the owner has an incentive to invest in the system.

Investment. Newly owned systems typically need catch-up investment before they can be maintained sustainably. The dependency versions need updating. The monitoring needs improving. The documentation needs writing. The tests need repairing. This investment has to come from somewhere, and “wherever the team can find it” is not a real answer. A specific allocation for the catch-up work is required.

# A practical ownership transfer process
# that actually transfers ownership rather than just reassigning blame:

OWNERSHIP_TRANSFER_CHECKLIST = {
    "knowledge_transfer": [
        "Architecture walkthrough with current and incoming owner",
        "Incident history reviewed and documented",
        "Known failure modes listed in runbook",
        "Dependency map updated and verified",
        "Hidden complexity identified and documented",
    ],
    "infrastructure": [
        "All dependencies inventoried with current versions",
        "Security vulnerabilities scanned and catalogued",
        "Monitoring coverage assessed against RED method",
        "Alerting reviewed: are alerts actionable?",
        "On-call runbook written or updated",
    ],
    "catch_up_investment": [
        "Backlog of deferred maintenance created and sized",
        "Capacity allocated for catch-up work",
        "Critical dependency updates scheduled",
        "Test coverage assessed and improvement planned",
    ],
    "ongoing_ownership": [
        "Team capacity allocated for maintenance",
        "Reliability metrics defined and visible",
        "Deprecation or improvement roadmap created",
        "Consumer communication channel established",
    ],
}


def transfer_ownership(system: str, from_team: str, to_team: str) -> bool:
    """
    Returns True only when all checklist items are complete.
    Ownership does not transfer until this function returns True.
    A system with an incomplete checklist remains with the current owner.
    There is no partial transfer. There is no "we'll finish the rest later."
    """
    completed = assess_checklist(system, OWNERSHIP_TRANSFER_CHECKLIST)
    incomplete = [item for item in completed if not completed[item]]

    if incomplete:
        log.warning(
            "ownership.transfer.blocked",
            system=system,
            to_team=to_team,
            incomplete_items=incomplete,
        )
        return False

    execute_transfer(system, from_team, to_team)
    return True

The audit worth running

Most organisations do not know the full extent of their unowned infrastructure. They know about the systems that have caused recent incidents. They do not have a complete picture of the systems that are working, depended upon, and unowned, which are more numerous and more dangerous because their danger is quiet.

The audit is not complicated. It requires someone with standing to ask the question systematically and the authority to act on the answer.

For every system that is running in production and is depended upon by other systems or teams: who is the owner? Not the team it is assigned to. The specific person who is responsible for its health, who would be paged if it broke, who knows its failure modes, who has the context to debug it under pressure.

If that person cannot be named, the system is unowned, and the audit has found something that needs to be addressed.

The second question: if that person left tomorrow, who would own it? The system that has one owner and no successor is more vulnerable than it appears. The owner is a single point of failure for all of the knowledge and accountability that makes the system safe to run.

The third question: when did someone last actively look at this system when it was not broken? Not during an incident. Proactively, with the intention of understanding its health and identifying problems before they became incidents. If the answer is never or more than six months ago, the system is being operated reactively, which means the organisation is waiting for the incident to tell it what needs attention. System Health Review (run quarterly for each owned system): Owner: [name, not team] Backup owner: [name] Last proactive review: [date] Last incident: [date, duration, root cause] Open maintenance items: [count and brief description] Dependencies: Critical dependencies out of date: [list] Security vulnerabilities unpatched: [count] Observability: Monitoring coverage: [what is measured] Alert quality: [are alerts actionable?] Runbook completeness: [is the runbook current?] Risk assessment: Single points of failure: [list] Known technical debt with incident potential: [list] Estimated time to resolve if owner left tomorrow: [hours] Action items with owners and dates: [specific, time-bounded, named]

The thing the organisation is actually choosing

Every organisation that has unowned infrastructure has made a choice, even if it did not make it consciously. It has chosen to carry the risk of those systems failing unexpectedly, in ways that nobody is prepared for, at times that cannot be predicted.

This is sometimes the right choice. A system that is simple, well- understood even without a formal owner, and not on any critical path can be allowed to exist without formal ownership as long as the organisation is clear-eyed about the risk. The mistake is not carrying risk. It is carrying risk invisibly, without accounting for it, without choosing it deliberately.

The organisations that manage this well treat infrastructure ownership the same way they treat financial liabilities. The unowned system is a contingent liability. It may cost nothing. If it costs something, it will cost a lot, at a time that is inconvenient, in a way that is hard to predict. Responsible organisations account for their contingent liabilities. They know what they are carrying. They make deliberate choices about which liabilities to resolve and which to accept.

The audit is not about fixing everything. Some of what it finds will not be worth fixing, and that is a legitimate decision. But it should be a decision, made by people who understand what they are deciding, not an omission that was never examined.

The system that nobody owns is owned by everyone in the worst sense: when it breaks, everyone pays the cost, and nobody had the standing to prevent it.

Name the owner. Give them what they need. Look at the thing you have been not looking at.

That is the work. It is not glamorous. It is more important than most of what is on the roadmap.