API Decisions You Can't Take Back
Most mistakes in software are recoverable. You write bad code, you refactor it. You choose the wrong data structure, you replace it. You design a database schema poorly, you migrate it. These things have costs — sometimes significant costs — but they’re fundamentally reversible. The old version can be replaced by a new version, and the new version can be deployed.
APIs are different.
The moment you expose an interface to a consumer — whether that consumer is an external developer, a mobile app, a partner’s system, or another service in your own infrastructure — you have made a commitment. Every field in the response, every parameter in the request, every status code, every error format, every URL structure is now something someone is depending on. Changing it breaks them. Breaking them has costs that fall primarily on you, not on them.
This is the thing that makes API design unlike most other software design decisions: the blast radius of a mistake is not contained within your codebase. It extends to every client that has ever integrated with you, including the ones you don’t know about, including the ones who integrated three years ago and haven’t updated their code since, including the mobile app versions that are still running on users’ phones and will never be updated.
The result is that APIs accumulate the decisions made when they were first designed, long after those decisions would have been changed if they lived anywhere else. I’ve debugged systems carrying the weight of API decisions made by developers who left the company before anyone currently on the team was hired. The decisions were not obviously wrong when they were made. They just weren’t made with the care that permanent decisions deserve.
The decisions that actually matter
Not every API decision is equally hard to reverse. Some things — adding new optional fields, adding new endpoints, adding new optional query parameters — are backwards compatible and can be done at any time. The decisions that cause permanent damage are the ones that are hard to change without breaking existing clients.
Identifiers. Whatever type and format you choose for your resource identifiers will be embedded in client code, stored in client databases, included in client logs, and passed back to you in requests for years. A decision made quickly about whether IDs are integers or UUIDs, whether they’re sequential or random, whether they’re numeric or alphanumeric — this decision is nearly impossible to reverse once clients exist.
Integer IDs are simple, debuggable, and reveal information you may not want revealed: how many resources exist, what order they were created in, roughly when a resource was created. Once you’ve shipped integer IDs and clients are storing and displaying them, switching to UUIDs requires either a migration that invalidates every stored reference or a translation layer that maintains both systems indefinitely.
The case for UUIDs from the start is not just theoretical. An e-commerce platform with sequential order IDs tells its competitors exactly how many orders it’s processing. A platform with sequential user IDs tells any user exactly how many users signed up before them — which is information that affects perception of the product, especially early on. These are not edge cases. They’re things that have caused real problems.
There is a middle path that’s worth considering: ULIDs or other lexicographically sortable unique identifiers that give you global uniqueness without exposing creation order in a way that’s enumerable. The choice between these options is not the important thing. Making the choice deliberately, before the first client exists, is.
import uuid
from ulid import ULID
def generate_order_id() -> str:
# UUID: globally unique, not sortable, not enumerable
# return str(uuid.uuid4())
# ULID: globally unique, lexicographically sortable by creation time,
# not enumerable, URL-safe
return str(ULID())
# Produces: 01ARZ3NDEKTSV4RRFFQ69G5FAV
# What not to do:
def create_order_wrong(db) -> dict:
result = db.execute("INSERT INTO orders DEFAULT VALUES RETURNING id")
return {"id": result.id} # Sequential integer. Permanent decision.
# What to do:
def create_order_right(db) -> dict:
order_id = generate_order_id()
db.execute("INSERT INTO orders (id) VALUES (?)", order_id)
return {"id": order_id}
Envelope structure. Does your API return resources directly or wrapped in an envelope? Both choices are common and both have consequences.
Direct response:
{
"id": "01ARZ3NDEKTSV4RRFFQ69G5FAV",
"email": "[email protected]",
"created_at": "2026-05-14T09:00:00Z"
}
Enveloped response:
{
"data": {
"id": "01ARZ3NDEKTSV4RRFFQ69G5FAV",
"email": "[email protected]",
"created_at": "2026-05-14T09:00:00Z"
},
"meta": {
"request_id": "req_abc123"
}
}
The envelope gives you space to add metadata — request IDs, pagination information, deprecation warnings — without changing the structure of the resource itself. The direct response is simpler for clients who don’t need the metadata. Once you’ve shipped either one, switching to the other is a breaking change. Every client has written code against the structure they received.
The argument for envelopes is that you will eventually want the space they provide. You don’t know on day one that you’ll want to add deprecation warnings to responses, but you will want to. You don’t know you’ll want to add server-timing metadata, but you will. The envelope costs almost nothing upfront and pays for itself the first time you need it.
Pagination. The decision between offset pagination and cursor-based pagination looks like an implementation detail. It’s not. It’s an API contract.
Offset pagination: GET /orders?page=3&per_page=20
Cursor pagination: GET /orders?after=01ARZ3NDEKTSV4RRFFQ69G5FAV&limit=20
Offset pagination is intuitive — page three of twenty results per page is a concept users understand. It’s also incorrect for most real use cases. If items are added or removed from the result set while a client is paginating through it, offset pagination returns duplicates or skips items. For a client that’s processing all orders to generate a report, this means the report is wrong. For a client that’s displaying a feed, this means items appear or disappear unexpectedly.
Cursor pagination is correct. The cursor points to a specific item in the result set. Adding or removing items before or after the cursor doesn’t affect what comes next. The tradeoff is that you can’t jump to an arbitrary page — you have to paginate forward from a position.
The client code for these two approaches is different enough that switching from one to the other requires clients to rewrite their pagination logic. Once offset pagination is shipped and clients have written code against it, the migration to cursor pagination is a breaking change. The correct approach is cursor pagination from the start, accepting the constraint that arbitrary page jumps are not supported, because that constraint is actually an accurate reflection of how database-backed pagination works.
from dataclasses import dataclass
from typing import TypeVar, Generic, Optional
T = TypeVar('T')
@dataclass
class CursorPage(Generic[T]):
items: list[T]
next_cursor: Optional[str]
has_more: bool
async def get_orders(
after: Optional[str] = None,
limit: int = 20,
) -> CursorPage:
limit = min(limit, 100) # Hard cap. Never expose unlimited.
query = """
SELECT id, user_id, total, status, created_at
FROM orders
WHERE ($1::text IS NULL OR id > $1)
ORDER BY id ASC
LIMIT $2
"""
rows = await db.fetch(query, after, limit + 1)
has_more = len(rows) > limit
items = rows[:limit]
next_cursor = items[-1]["id"] if has_more else None
return CursorPage(
items=[dict(row) for row in items],
next_cursor=next_cursor,
has_more=has_more,
)
The response shape this produces:
{
"data": [
{"id": "01ARZ3...", "total": 49.99, "status": "shipped"},
{"id": "01ARZ4...", "total": 129.00, "status": "delivered"}
],
"pagination": {
"next_cursor": "01ARZ4NDEKTSV4RRFFQ69G5FAV",
"has_more": true
}
}
Error format. Errors are the part of the API that gets the least design attention and causes the most client pain. When something goes wrong, the client needs to know: what went wrong, why, and whether retrying will help. An error response that conveys only the HTTP status code conveys almost none of this.
The error format you ship first is the format clients will build error handling around. Changing it later means every client’s error handling breaks.
from enum import Enum
from dataclasses import dataclass, field
from typing import Optional
class ErrorCode(str, Enum):
# Validation errors
VALIDATION_FAILED = "validation_failed"
REQUIRED_FIELD_MISSING = "required_field_missing"
INVALID_FORMAT = "invalid_format"
# Auth errors
UNAUTHORIZED = "unauthorized"
FORBIDDEN = "forbidden"
TOKEN_EXPIRED = "token_expired"
# Resource errors
NOT_FOUND = "not_found"
CONFLICT = "conflict"
GONE = "gone"
# Downstream errors
PAYMENT_FAILED = "payment_failed"
INVENTORY_UNAVAILABLE = "inventory_unavailable"
# System errors
INTERNAL_ERROR = "internal_error"
SERVICE_UNAVAILABLE = "service_unavailable"
RATE_LIMITED = "rate_limited"
@dataclass
class APIError:
code: ErrorCode
message: str
request_id: str
retryable: bool = False
details: dict = field(default_factory=dict)
doc_url: Optional[str] = None
def to_response(self) -> dict:
response = {
"error": {
"code": self.code.value,
"message": self.message,
"request_id": self.request_id,
"retryable": self.retryable,
}
}
if self.details:
response["error"]["details"] = self.details
if self.doc_url:
response["error"]["doc_url"] = self.doc_url
return response
# Usage:
def handle_payment_error(e: PaymentException, request_id: str) -> tuple[dict, int]:
error = APIError(
code=ErrorCode.PAYMENT_FAILED,
message="Payment could not be processed.",
request_id=request_id,
retryable=e.is_transient,
details={
"payment_provider": e.provider,
"decline_code": e.decline_code,
},
doc_url="https://docs.example.com/errors/payment_failed",
)
return error.to_response(), 402
The retryable field is worth highlighting. A client that doesn’t know
whether to retry is going to either retry everything — causing thundering
herd problems during incidents — or retry nothing, causing unnecessary
failures during transient errors. Telling the client explicitly whether
a retry is appropriate is one of the highest-value things an API can do
and almost nobody does it.
Date and time format. This sounds trivial. It is not trivial.
If you return timestamps as Unix epoch integers, clients have to convert them to understand them. Debugging log output becomes harder. The format is unambiguous about the point in time but says nothing about timezone.
If you return timestamps as strings without timezone information — “2026-05-14 09:00:00” — you have created an ambiguity that clients will resolve incorrectly in at least some cases, and the bugs will be subtle and timezone- dependent.
The correct format is ISO 8601 with explicit UTC timezone: 2026-05-14T09:00:00Z.
It’s human-readable. It’s unambiguous. Every language has a standard parser for it.
The Z suffix makes the timezone explicit. There is no reasonable argument against
it, and yet a significant fraction of APIs return something else.
from datetime import datetime, timezone
def format_timestamp(dt: datetime) -> str:
# Always ensure UTC, always include timezone
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
else:
dt = dt.astimezone(timezone.utc)
return dt.strftime("%Y-%m-%dT%H:%M:%SZ")
# In your serializer:
def serialize_order(order) -> dict:
return {
"id": order.id,
"created_at": format_timestamp(order.created_at),
"updated_at": format_timestamp(order.updated_at),
# Not: order.created_at.timestamp() (epoch integer)
# Not: str(order.created_at) (no timezone)
# Not: order.created_at.isoformat() (may omit Z suffix)
}
Versioning is not the escape hatch it appears to be
When developers discover they’ve made a bad API decision, the instinct is to version the API. V1 has the bad decision. V2 has the fix. Clients migrate when they’re ready.
This approach has more problems than it solves.
Versioning means maintaining two versions of the API indefinitely, because not all clients will migrate and you cannot force them to. You now have two codepaths to test, two sets of edge cases to handle, two versions of the documentation to maintain. The V1 clients who never migrated are the ones most likely to be running old, forgotten integrations where nobody knows the business purpose anymore, which means they’re the hardest to migrate and the last to be turned off.
More importantly: versioning does not fix the underlying problem for clients who have already integrated. A client that has built their entire system around V1’s integer IDs cannot migrate to V2’s UUIDs without updating their own database schema, their own API responses, their own logs, their own user-facing displays. The version number changed. The migration is still enormous.
Versioning is valuable for making breaking changes without immediate disruption. It is not a mechanism for making those changes cheap. The cheapest thing is to make the right decision before there are clients.
If you’re going to version, version by major breaking change rather than by calendar period or feature release. V2 should represent a genuine break with V1’s contract, not a new bundle of features that could have been added backwards-compatibly.
# URL versioning — clear, explicit, widely understood
GET /v1/orders/{id}
GET /v2/orders/{id}
# Header versioning — cleaner URLs, less visible, harder to test in browser
GET /orders/{id}
Accept: application/vnd.example.v2+json
# Query parameter versioning — avoid. Easy to forget, messy to cache.
GET /orders/{id}?version=2
URL versioning is the most practical for most APIs. It’s explicit, cacheable, and testable without special tooling. The cost — that the version is in the URL — is also its benefit. It’s visible. You can’t forget it.
The practice that prevents most of this
The cleanest way to avoid permanent API mistakes is to treat your first API client as a design review. Not a user. A reviewer.
Before you ship an API endpoint, write the client code for it. Not the server code — the client code. Write the code that would call your endpoint, parse the response, handle the errors, paginate through the results. Do this before you finalize the design.
This practice is uncomfortable because it forces you to think about the API from the perspective of the person who has to live with it, before you’ve committed to a design. It’s uncomfortable in the way that reading your own writing aloud is uncomfortable — it surfaces problems that looked invisible in the abstract.
# Write this before you finalize the API design.
# If it's awkward, the API is awkward.
async def sync_orders(last_sync_cursor: Optional[str]) -> str:
"""
Sync orders from the API, returning the cursor for the next sync.
"""
cursor = last_sync_cursor
synced = 0
while True:
response = await api_client.get(
"/v1/orders",
params={"after": cursor, "limit": 100},
)
if not response.ok:
error = response.json()["error"]
if error["retryable"]:
await asyncio.sleep(backoff(attempt))
continue
raise APIError(error["code"], error["message"])
page = response.json()
for order in page["data"]:
await process_order(
id=order["id"],
created_at=datetime.fromisoformat(order["created_at"]),
total=Decimal(str(order["total"])),
status=order["status"],
)
synced += 1
if not page["pagination"]["has_more"]:
break
cursor = page["pagination"]["next_cursor"]
logger.info("sync.completed", orders_synced=synced, next_cursor=cursor)
return cursor
If this code is clean, the API design is probably clean. If you find yourself writing defensive checks for edge cases that shouldn’t exist, or parsing strings that should have been typed, or handling errors that don’t have enough information to handle correctly — the design has problems that are cheaper to fix now than after the first client is in production.
What you’re actually designing
An API is a contract. Not in the legal sense — most APIs don’t come with enforceable SLAs for backwards compatibility. In the practical sense: people will build things that depend on it, and those things will break when it changes in ways they didn’t expect.
The decisions that go into that contract deserve the weight of permanent decisions, even when they’re made in a sprint. Especially when they’re made in a sprint, because the pressure of moving fast is exactly when the temptation to defer design questions is highest.
The questions worth slowing down for: What type are the identifiers? What does the envelope look like? How does pagination work? What information does an error carry? What format are timestamps in? How will breaking changes be communicated?
None of these questions is hard to answer once you’ve thought about them. They’re hard to answer later, when the answer requires coordination with every client you’ve ever had.
Answer them now. Then ship with confidence.
