ArchitectureDevOpsAI

The Reliability Contract Nobody Signed

reliability image

Somewhere in the terms of service of almost every software product is a clause that says something like: the service is provided as-is, we make no guarantees of availability, we are not liable for any disruption to your use of the platform.

Nobody reads this clause. More importantly, nobody believes it. When the service goes down, when the data is lost, when the feature that a business has built its operations around disappears without warning, the legal disclaimer does nothing to prevent the feeling of betrayal that follows. The users feel let down. They feel that something was taken from them. They feel that the product failed a promise it made.

The product never made that promise. Not formally. Not legally.

And yet the promise is real, in the sense that it shapes behaviour just as surely as a written contract would. Users build their workflows around software they have decided to trust. They make commitments to clients and colleagues that depend on that software working. They store things they care about inside systems they do not control. They do all of this without a guarantee, because the alternative is not using the software, and not using the software means falling behind everyone else who is using it.

The gap between the legal reality (no guarantees) and the lived reality (complete dependency) is where most of the interesting tension in software reliability lives.

How trust accumulates silently

Trust in a software product does not arrive as a decision. It accumulates through repeated successful use.

The first time someone uses a new tool they are cautious. They keep a backup. They do not build critical workflows around it. They test it in low-stakes situations before relying on it for anything important. This is rational behaviour in the presence of uncertainty about reliability.

The tenth time they use it without incident, the caution begins to fade. The hundredth time, it has largely disappeared. The tool has demonstrated that it works, and the brain, which is not well calibrated to reason about tail risks, converts this track record into confidence that it will continue to work.

By the time a person has used a tool for two years, the trust is so deeply embedded in their working patterns that they have forgotten it exists as trust rather than fact. They do not think “I trust that this will work.” They think “this will work,” the way they think “the light will come on when I flip the switch.” The tool has been mentally reclassified from a product with uncertain reliability to a piece of infrastructure with assumed reliability.

The product has not changed. It still has the same SLA it always had, which is probably vague and full of carve-outs. What has changed is the user’s mental model of it, and the user’s mental model is what determines how they feel when it fails.

This is why outages at mature products with large user bases produce a level of anger that seems disproportionate to the technical reality of what happened. A two-hour outage is not a catastrophe in any absolute sense. It is a catastrophe in the relative sense of violating an expectation that had been quietly building for years. The anger is proportional to the trust that was accumulated, which is proportional to the time the product was reliable, which means the most trusted products produce the most intense reactions when they fail.

What the team experiences

From the other side of this, the experience of an engineering team during a significant outage is genuinely difficult in a way that the users who are angry about it rarely understand.

The team knows things that the users do not. They know how hard reliability is to maintain at scale. They know that the failure was not intentional and that significant effort went into preventing it. They know that the people who were working to resolve the incident were under real pressure, often for extended periods, often in the middle of the night.

They also know, in a way that is uncomfortable to sit with, that the anger is not entirely unjustified. They did accumulate that trust. They did build something that people came to depend on. They did fail to maintain the reliability that the dependency implied. The disclaimer in the terms of service does not change the moral reality of what it means to have people build their working lives around something you made and then have it fail them.

The response to this knowledge varies in ways that reveal a lot about the culture of the team.

Some teams retreat into the legal reality: we promised nothing, the terms are clear, the anger is misplaced. This is technically defensible and relationally destructive. The users who feel betrayed do not feel less betrayed because they are told they were not owed what they thought they were owed.

Some teams are genuinely contrite in a way that is almost excessive, apologising in terms that acknowledge obligations that create precedents the company may not intend to set. This is relationally intelligent and occasionally legally awkward.

The teams that navigate this most effectively acknowledge the reality of the dependency that was built while being honest about the limits of what any software can guarantee. They treat the trust as real, even though it was never formally extended, because it was real in its effects on the people who extended it.

The silent upgrade that changed the product

There is a specific version of this contract violation that generates particular anger and that the software industry has not fully reckoned with: the unilateral change.

A user builds something that depends on a feature. The feature works a certain way. The user knows how it works and has designed around it. Then the product changes the feature, or removes it, or changes the behaviour in a way that breaks everything built around the old behaviour.

From the product’s perspective, this is often a legitimate and necessary change. The old behaviour was technically wrong, or was creating scaling problems, or was inconsistent with how the rest of the product worked. The new behaviour is better in ways that matter to the product’s long-term health.

From the user’s perspective, something they depended on was taken from them. Not something that was offered as provisional. Something that they built on, which creates a reasonable inference that it was intended to be stable.

The specific anger this produces is different from the anger of an outage. An outage is unexpected. The product failed at what it was trying to do. An unilateral change is intentional. The product decided that what the user had built around it mattered less than some other consideration. The user was not consulted. They found out when their thing stopped working.

This happens constantly in the software industry. It happens with more intensity in the current moment because so many products are built on top of AI capabilities whose behaviour is even less predictable and more subject to change than traditional software. A product built on top of a language model depends on that model behaving a certain way. When the model is updated, the behaviour changes. The product’s users experience this as their tool having changed without warning, because it has.

The implicit promise that makes this feel like a betrayal is not anywhere in the terms of service. It is in the nature of building something and inviting people to build on top of it.

What reliability actually requires

Reliability in software is not primarily a technical problem, though it has technical components. It is a problem of alignment between what a product commits to and what it actually delivers, where the commitment is not just the formal one in the SLA but the informal one embedded in how the product presents itself and how users come to depend on it.

The products that have the best relationships with their users around reliability are not necessarily the ones with the highest uptime, though uptime matters. They are the ones that have been honest about the gap between the formal commitment and the informal expectation.

They communicate proactively when things go wrong rather than waiting for users to notice. They explain what happened in terms that are meaningful to users rather than in technically accurate but practically opaque language. They are honest about what they do not yet know during an incident rather than projecting confidence they do not have. They acknowledge when a change will break things people have built rather than treating the breakage as the user’s problem for having built around undocumented behaviour.

None of this is engineering work in the narrow sense. It is the work of maintaining a relationship, which requires understanding that a relationship exists even when it was never formally defined. What honest reliability communication looks like: During an outage: Not: “We are investigating reports of service degradation.” But: “Our payment processing is down. We know this is breaking your checkouts. Here is what we know, here is what we do not know yet, and here is where we will post updates.” After an outage: Not: “The incident was caused by a configuration error that has been corrected. We apologise for any inconvenience.” But: “Here is exactly what happened, when we knew about it, how long it took us to fix it, why it took that long, and what we have changed so this specific failure mode cannot happen again.” Before a breaking change: Not: Release notes buried in a changelog with a six-week notice. But: Direct communication to the users who will be affected, describing specifically what will break, when, and what they need to do to prepare. With enough lead time to actually prepare. When you do not know: Not: “We are working to resolve this as quickly as possible.” But: “We do not yet know what caused this. We are still investigating. We will update in thirty minutes whether or not we have an answer by then.”

The last one is the hardest. Communicating uncertainty during an incident requires resisting the institutional pressure to project competence when the honest position is that you do not fully understand what is happening yet. The pressure exists for understandable reasons: uncertainty increases user anxiety, and organisations naturally want to manage that anxiety. But false confidence during an incident is discovered when the resolution reveals that the earlier confident statements were not accurate, which damages trust more than the honest uncertainty would have.

The dependency that cannot be undone

There is a category of reliability failure that is qualitatively different from outages and breaking changes: the product that shuts down.

When a software product shuts down, everything built on top of it stops working. The data stored in it becomes inaccessible or is lost entirely. The workflows designed around it have to be redesigned. The institutional knowledge that accumulated around using it specifically has to be rebuilt around something else. The switching cost, which was the product of years of use, is paid all at once in the worst possible circumstances.

This has happened often enough that it should inform how people think about building on any particular product. Consumer products shut down. Developer tools shut down. Infrastructure services shut down. Companies get acquired and their products get discontinued. Pricing models change in ways that make continued use impossible for the users who were depending on the product being affordable.

The users who are hurt most are the ones who built the deepest dependency. The switching cost they accumulated through years of use was supposed to be a cost they would never have to pay because they would never switch. When the product shuts down, that assumption is violated. The cost is paid not on their terms but on the product’s terms.

This is not a reason to avoid building on external products. The alternative, building everything from scratch, is impractical and produces worse outcomes. It is a reason to think carefully about which dependencies are deep enough to make this a serious risk, and to make deliberate choices about which products to build that kind of dependency on.

The questions worth asking before building a deep dependency on any product: how long has it existed and what does its trajectory suggest about its future? Who owns it and what are their incentives? Is there a plausible migration path if it goes away? What would it cost to rebuild what I am building around it, and is that cost acceptable if I have to pay it unexpectedly?

These questions do not always have reassuring answers. Sometimes the right product to build on is one whose future is uncertain, because it solves the problem better than anything else available. The point is to make that choice with eyes open rather than to discover its risks when they materialise.

The relationship that software creates

The reason the reliability contract matters, even though it was never signed, is that software creates a relationship between the people who build it and the people who use it. This is not a metaphor. It is a description of a real interdependency that has real consequences for both parties.

The people who build software make decisions that affect the working lives, sometimes the personal lives, of the people who use it. They decide what features exist and how they work and when they change and whether they continue to exist at all. They decide how reliable the product will be by choosing how much to invest in reliability. They decide how to communicate when things go wrong and what to prioritise in the aftermath.

The people who use software give the product a role in their lives that makes them vulnerable to these decisions. They cannot evaluate most of the decisions being made. They cannot see inside the product. They do not know what trade-offs were made during development or what risks are being accepted on their behalf. They are depending on the judgment and values of the people who built the product, without having negotiated the terms on which they depend.

This is not unique to software. It is the normal condition of depending on anything built by other people. But it has a particular texture in software because the product can be changed so quickly, because the changes can take effect immediately for everyone, and because the complexity of what people build around software has grown to the point where the cost of disruption to the dependency is very high.

The industry that has grown up around building software has not, for the most part, developed a culture that fully acknowledges this relationship and its asymmetry. The legal frameworks are designed to protect the builder from liability for the dependency they created. The communication norms are designed to manage user anxiety rather than to give users an accurate picture of the risks they are carrying. The product decisions are made according to the product’s interests without always considering what those decisions mean for the people who have built around the product.

The products that have the most loyal users, that survive the longest, that weather failures and come back from them, are usually the ones whose teams understand the relationship and act accordingly. They are not the ones with the lowest failure rates, though they try to keep those low. They are the ones where the users feel, after something has gone wrong, that the team was honest with them, understood what the failure cost them, and genuinely tried to do better.

That feeling is worth more than any SLA.

It is also not in any terms of service.

It is something that gets built through a long sequence of choices about how to treat the people your product is made for, and it is lost faster than it is built.

The contract nobody signed turns out to be the one that matters most.