The Code That Runs at 3am
Six months ago you were in a good mood. It was a Tuesday afternoon. You had coffee. The ticket was clear. You understood the codebase. You wrote something that solved the problem elegantly, got it reviewed, merged it, and moved on.
At 3am this morning, that code is killing you.
Not metaphorically. You’re awake. Your phone went off forty minutes ago. The service is down or degraded or behaving wrongly and customers are affected and someone is asking for an ETA on the fix and you are staring at code you wrote when you were a completely different person — rested, caffeinated, fully context-loaded — trying to understand it well enough to change it safely while you are none of those things.
This is the test that most code never gets designed to pass. And it’s the only test that actually matters in production.
The two states of a developer
When you write code, you are in what I’ll call the high-context state. You know why this file exists. You know what the adjacent system does. You know what problem this code is solving and what alternatives you considered and rejected. You know the quirks of this particular service. You wrote the tests. You remember what edge cases you handled and which ones you explicitly decided not to handle.
High-context you is a capable, well-informed engineer making reasonable decisions with full situational awareness.
When you debug an incident at 3am, you are in the low-context state. You are tired. You have been asleep or trying to sleep. You are working against urgency — customers are affected, someone is waiting, the longer this takes the worse the impact. You are probably looking at logs and error messages first rather than at the code, which means by the time you open the code you already have a half-formed theory in your head that may or may not be right. The code you’re reading looks like someone else wrote it because, in a meaningful sense, they did.
Most code is written by high-context you and debugged by low-context you. The problem is that high-context you almost never thinks about this. High- context you is solving the problem in front of them, not imagining the experience of a future self who will be reading this half-asleep while an incident is open.
This is not a personal failure. It’s a predictable consequence of the cognitive state you’re in when writing code. You cannot easily simulate being exhausted and context-free when you are rested and fully loaded. The high-context state actively prevents you from accurately imagining the low-context experience.
Which means you have to build the habits that compensate for this blindspot — not when you’re in an incident, but right now, when the stakes are low and you have time to think.
What 3am actually looks like
I want to be specific about what the experience of debugging at 3am actually involves, because the specificity is where the lessons live.
The alert fires. You acknowledge it. You look at the dashboard. Something is wrong — error rate spiked, latency is up, a queue is backing up, a health check is failing. You have a graph that shows you the what. You have almost no immediate information about the why.
You start forming hypotheses. Was there a deployment recently? You check. Yes, three hours ago. That’s suspicious but deployments don’t always cause problems immediately. You look at what changed. It’s a diff of four files, two hundred lines. You try to understand what it does at 3am with your brain running at sixty percent.
You check the logs. There are errors. The errors say something. Whether what they say is useful depends on decisions made months ago by the person who wrote the code that’s failing. If that person — past you, or a colleague — wrote errors that explain what happened and why and what state the system was in when it happened, you have information. If they wrote errors that made sense in the context of writing the code but assume context you don’t have right now, you have noise.
You try to understand the scope. How many users are affected? Is it all of them or some of them? Is it a specific operation or all operations? Is it getting worse or is it stable? The answers to these questions determine how urgently you need to act and whether you have time to understand before you intervene.
At some point you have to make a decision under uncertainty. Do you roll back the deployment? Do you apply a quick fix? Do you restart the service and hope? Each of these has risks. Rollback is safe if the deployment caused the issue and risky if it didn’t, because then you’ve introduced another change into an already broken system. A quick fix is safe if you understand the problem well enough and dangerous if you don’t. Restart buys you time but doesn’t tell you anything.
The quality of this decision depends entirely on the quality of the information available to you, and the quality of the information available to you depends on decisions made months ago in the daytime by someone who wasn’t thinking about this moment.
The artifacts that save you
There are specific things that make the 3am experience survivable. None of them are complicated. Most of them are consistently skipped because they feel unnecessary when everything is working.
Logs that tell a story, not just a state.
There’s a version of logging that captures what happened:
Error: database query failed. There’s a version that captures
what happened and why it matters and what context surrounded it:
logger.error(
"payment.processing.failed",
order_id=order.id,
user_id=order.user_id,
payment_provider=provider.name,
attempt_number=attempt,
error_type=type(e).__name__,
error_message=str(e),
order_total=float(order.total),
customer_tier=order.user.tier,
time_since_order_created_seconds=(
datetime.utcnow() - order.created_at
).total_seconds(),
)
The second version takes thirty extra seconds to write. At 3am it tells you exactly which provider is failing, for which tier of customer, how many times it has been tried, and whether this is a new order or one that’s been sitting in the system. You can answer the question “who is affected and how badly” in thirty seconds instead of twenty minutes.
The logging question to ask when writing any error path: if I were debugging this at 3am with no other context, what would I need to know? Write that.
Runbooks that are honest about uncertainty.
A runbook is documentation for an incident scenario. Most runbooks I’ve read are written optimistically — they describe what to do when you’re confident about what’s wrong. They don’t describe what to do when you’re not sure.
The most useful runbook I’ve ever used started with a section called “If you’re not sure what’s wrong.” It had a list of diagnostic commands in the order to run them, what each one told you, and how to interpret the output. It didn’t assume you already had a hypothesis. It assumed you were starting from zero.
## Payment Service Degradation
### If you don't know what's wrong yet, start here:
1. Check error rate by provider:
`kubectl exec -it payment-pod -- python manage.py payment_stats --last 30m`
If one provider is failing and others aren't: → see "Provider-specific failure"
If all providers are failing: → see "Complete payment failure"
If error rate is <5%: this may be normal variance, watch for 10 more minutes
2. Check if this correlates with a deployment:
`git log --since="2 hours ago" --oneline origin/main`
If yes: consider rollback (see "Rollback procedure")
If no: deployment is probably not the cause
3. Check downstream dependency health:
`curl https://status.stripe.com/api/v2/status.json | jq .status`
If degraded: this is external, open incident with provider,
inform stakeholders, no code change needed
### Signs this is getting worse (escalate immediately if):
- Error rate above 20% for more than 5 minutes
- Queue depth above 10,000 (normal is <500)
- Any database connection errors in the logs
This runbook assumes the reader is not sure what they’re looking at. It gives them a diagnostic path rather than a solution. Most runbooks give solutions to problems the reader can’t yet identify.
Deployment diffs that explain, not just show.
When you’re looking at a deployment that might have caused a problem, you need to understand what changed quickly. The git diff shows you what changed. It doesn’t tell you what effect those changes are expected to have or why the change was made.
A PR description written for the daytime reader tells you the ticket number and a brief description. A PR description written for the 3am reader tells you what this change does, what the expected effect on system behavior is, whether there are any edge cases to be aware of, and what to check if something goes wrong.
## What this does
Changes the retry logic for failed payment processing to use
exponential backoff with jitter instead of fixed 5-second retries.
## Why
Fixed retries were causing thundering herd against Stripe during
partial outages. Exponential backoff with jitter distributes the
retry load.
## Expected behavior change
- Failed payments now retry at 1s, 2s, 4s, 8s, 16s (with jitter)
instead of 5s, 5s, 5s, 5s, 5s
- Total retry window increases from 25s to max ~31s
- Stripe load during payment failures should decrease significantly
## If something seems wrong after this ships
- Check payment retry latency in dashboard (should be higher per
retry, lower overall due to reduced thundering herd)
- If all retries are failing immediately (not backing off), check
that the retry configuration loaded correctly:
`kubectl exec payment-pod -- env | grep RETRY`
- This change does not affect the payment success rate on non-
retried attempts
The 3am engineer who is looking at this deployment as a potential cause of the current incident needs to know: would this change cause what I’m seeing? The PR description should answer that question.
The architecture of survivability
Beyond the artifacts, there are structural decisions that determine how bad 3am gets.
Can you roll back in five minutes?
If the answer is no, your deployment process is building technical debt into every release. A rollback should be a single command that takes less than five minutes and doesn’t require knowing which previous version was good. If it requires manual steps, coordination, or knowledge of the codebase to execute, it will be slow and error-prone at exactly the moment you need it to be fast and reliable.
#!/bin/bash
# rollback.sh
# Usage: ./rollback.sh [steps]
# Rolls back to N releases ago. Default is 1.
STEPS=${1:-1}
CURRENT=$(kubectl get deployment api -o jsonpath='{.metadata.annotations.deployed-version}')
TARGET=$(git log --oneline origin/main | awk "NR==$((STEPS + 1)) {print \$1}")
echo "Rolling back from $CURRENT to $TARGET"
echo "Press enter to confirm or Ctrl+C to cancel"
read
kubectl set image deployment/api api=ghcr.io/myteam/api:$TARGET
kubectl rollout status deployment/api --timeout=5m
echo "Rollback complete. Verify: https://dashboard.internal/api-health"
The confirmation step is intentional. At 3am, you want one moment of deliberate action before an irreversible change.
Can you limit the blast radius?
A failure that affects all users is a different incident from a failure that affects five percent of users. Feature flags, gradual rollouts, and traffic splitting don’t just help you roll out features safely — they give you tools to contain an incident that’s already happening.
If a new code path is causing problems, being able to disable it with a flag — without a deployment — changes the incident from a “figure out what’s wrong and fix it” problem to a “turn off the thing that’s causing it and figure it out in the morning” problem. The second problem is solvable at 3am. The first often isn’t.
from flagsmith import Flagsmith
flags = Flagsmith(environment_key=FLAGSMITH_KEY)
async def process_payment(order: Order) -> PaymentResult:
# If the new payment flow is causing problems, this can be
# disabled from the dashboard without a deployment
if flags.is_feature_enabled("new_payment_flow"):
return await _process_payment_v2(order)
else:
return await _process_payment_v1(order)
Does your service degrade gracefully or fail completely?
A service that fails completely when a dependency is unavailable creates a binary incident: it’s working or it’s not. A service that degrades gracefully — returning cached data when the cache is fresh, disabling non-critical features when their dependencies are unavailable, accepting writes and processing them when the queue becomes available — gives you more options.
Graceful degradation is harder to build than binary fail/succeed. It requires thinking, while you’re building in the daytime, about which functionality is truly critical and which can be shed under load. This is exactly the kind of thinking that doesn’t happen naturally, because when you’re building a feature you’re thinking about how it works when everything is working.
async def get_product_recommendations(user_id: str) -> list[Product]:
try:
# Try to get personalised recommendations
async with asyncio.timeout(0.5): # Hard 500ms limit
return await recommendation_service.get(user_id)
except (asyncio.TimeoutError, ServiceUnavailableError):
# Fall back to popular products — not personalised, but
# the page still works and the user gets something useful
logger.warning(
"recommendations.fallback",
user_id=user_id,
reason="recommendation_service_unavailable"
)
return await get_popular_products(limit=10)
The user gets recommendations either way. The recommendation service being down doesn’t take the page down. At 3am, this is the difference between a high-urgency incident and a background task.
The habit change
Everything above is a daytime activity. You write better logs while you’re implementing the feature, not while you’re debugging the incident. You write the runbook while you’re building the service, not while it’s on fire. You design graceful degradation when you’re thinking clearly about what can fail, not when it’s already failing.
The habit change is this: before you merge anything that touches a production code path, spend five minutes imagining the 3am scenario.
Not abstractly. Specifically. Imagine the alert fires at 3am. You’re looking at the logs from this code. What do the logs tell you? Imagine you’re looking at this PR as a potential cause of an incident. Does the description tell you what to check? Imagine the service needs to be rolled back right now. How long does it take? Imagine a dependency this code calls becomes unavailable. What happens to users?
If the answers to any of these questions are uncomfortable, you have specific things to fix before you merge. Not everything — you’re not trying to eliminate all risk, which is impossible. The specific things that would make the 3am scenario meaningfully harder than it needs to be.
This takes five minutes. It costs almost nothing in development time. It pays back in every future incident that’s thirty minutes shorter because past you had the foresight to write a useful log line.
The teammate you haven’t met yet
There’s another version of this that extends beyond yourself. The code you write today will be debugged at 3am not just by future you but by your colleagues, by engineers who haven’t joined the team yet, by the person who will maintain this service after you’ve moved on.
These people will have no access to the context you have right now. They won’t know why you made this decision or what alternatives you considered or what edge cases you were aware of. What they’ll have is the code, the tests, the logs, and whatever documentation exists.
Writing for 3am is writing for them. The comment that explains why this timeout is 500ms and not 1000ms — that’s for the engineer who will question it during an incident. The log line that includes the customer tier — that’s for the person who needs to assess impact before they know what’s wrong. The PR description that explains what to check if something goes wrong — that’s for whoever is on call when it does.
The code you write is a letter to people you haven’t met, in situations you can’t predict, at hours neither of you would choose. Write it accordingly.
That’s not a philosophy. That’s engineering.