ArchitectureCareerDevOps

Technical Debt Is a Management Problem

Technical debt

The framing of technical debt as a technical problem is one of the most expensive misunderstandings in software engineering.

When engineers talk about technical debt, they mean code that is hard to work with, architecture that no longer fits the problem, tests that were never written, documentation that does not exist, dependencies that have not been updated, abstractions that accumulated without a plan. All of these are real. All of them are technical in their manifestation. None of them are technical in their origin.

Technical debt accumulates through a series of decisions about where engineering time goes. A feature is prioritised over refactoring. A deadline is hit by skipping tests. A quick fix ships instead of the right fix because the right fix would take a week and the quick fix takes a day. These are resource allocation decisions. They are made by people with authority over engineering time, which means they are made by managers, by product owners, by founders, by whoever controls what the team works on.

Engineers do not accumulate technical debt. Organisations accumulate technical debt through the decisions they make about engineering priorities. The engineers are the ones who live with it.

Why the misattribution matters

When technical debt is framed as a technical problem, the implicit responsibility for addressing it falls on engineers. The team should refactor more. The engineers should push back harder. The tech lead should advocate more effectively. The implication is that the debt exists because of technical failure, and the remedy is technical discipline.

This framing is not just inaccurate. It is actively harmful, because it places accountability where it does not belong and removes it from where it does.

An engineer who pushes back on a deadline because the code needs time to be done properly is exercising technical judgment in a context where they do not have final authority over the timeline. They can advocate. They cannot decide. If the person who does have authority decides that the deadline takes priority, the debt accumulates regardless of how clearly the engineer communicated the tradeoff. Blaming the engineer for the debt that resulted is blaming the person who identified the problem for the decision of the person who ignored the identification.

This happens constantly. Engineers internalise responsibility for conditions they did not create and cannot resolve without resources they do not control. The result is a specific kind of professional demoralisation that is distinct from ordinary frustration: the experience of being accountable for an outcome you cannot affect.

The management that holds engineers responsible for technical debt without examining its own role in creating that debt is managing poorly. Not maliciously, usually. The misattribution is common enough that many managers who perpetuate it have never had it named clearly for them.

How debt actually accumulates

Technical debt accumulates through three mechanisms, and understanding each one is necessary for understanding where the intervention belongs.

Explicit tradeoffs made under pressure. The team knows the right way to build the thing. There is not enough time to build it that way. Someone decides that the quick version ships and the right version gets addressed later. The debt is deliberate, visible, and understood at the moment it is created.

This is the version most people think of when they think about technical debt. It is also the version where the debt metaphor is most apt: you are borrowing against the future, you know the interest rate, and you intend to repay. The problem is that repayment requires future time allocation, and the same pressures that created the debt are usually still present when repayment is due. The debt rolls over. The interest accumulates.

Decisions that were right and became wrong. An architectural decision that was correct given the constraints and knowledge at the time becomes a source of friction as the system evolves and the constraints change. This is not a tradeoff or a mistake. It is the natural consequence of building software in a changing context.

This version of debt is less visible because it was not created consciously. The system was built correctly. The system became wrong. Recognising this requires enough distance from the original decision to see it without the defensiveness that comes from having made it. Teams often fail to recognise this category because the people who built the thing are still present, and acknowledging that the design is now a source of debt can feel like criticising their work.

Accumulated neglect of operational concerns. Tests not written. Documentation not maintained. Dependencies not updated. Monitoring not added. These are not single decisions. They are the cumulative result of never allocating time for them. The team always has more pressing things to do. The operational work gets perpetually deferred.

This is the most insidious category because it is invisible until it manifests as something painful. The dependency that has not been updated for two years does not cause a problem until a security vulnerability is disclosed in it. The test that was never written does not cause an incident until the behaviour it would have tested regresses in production. The absence of the work is not visible. Only its consequences are.

# Debt accumulation is measurable even when it feels abstract.
# These metrics make it concrete and trackable.

from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Optional
import subprocess


@dataclass
class DebtInventory:
    """
    A snapshot of technical debt at a point in time.
    Track this monthly. The trend is more important than any single number.
    """
    snapshot_date: datetime
    project_name: str

    # Dependency health
    outdated_dependencies: int = 0
    security_vulnerabilities: int = 0
    deprecated_dependency_usages: int = 0

    # Test coverage
    test_coverage_percent: float = 0.0
    untested_critical_paths: list[str] = field(default_factory=list)

    # Code health indicators
    cyclomatic_complexity_violations: int = 0
    duplicated_code_blocks: int = 0
    long_functions_over_50_lines: int = 0

    # Operational
    undocumented_services: int = 0
    alerts_without_runbooks: int = 0
    services_with_no_owner: int = 0

    # Age indicators
    oldest_unaddressed_bug_days: int = 0
    average_pr_age_before_merge_hours: float = 0.0

    def debt_score(self) -> int:
        """
        A single number for tracking trend over time.
        Not a precise measurement. A directional signal.
        Lower is better. Rising score means debt is accumulating.
        """
        score = 0
        score += self.outdated_dependencies * 2
        score += self.security_vulnerabilities * 10
        score += self.cyclomatic_complexity_violations * 3
        score += self.undocumented_services * 5
        score += self.alerts_without_runbooks * 5
        score += self.services_with_no_owner * 8
        score += max(0, 80 - self.test_coverage_percent) * 2
        score += self.oldest_unaddressed_bug_days // 30 * 3
        return score

    def is_worsening(self, previous: "DebtInventory") -> bool:
        return self.debt_score() > previous.debt_score()

    def report(self) -> str:
        lines = [
            f"Technical Debt Snapshot: {self.snapshot_date.date()}",
            f"Project: {self.project_name}",
            f"Debt Score: {self.debt_score()} (lower is better)",
            "",
            "Dependency Health:",
            f"  Outdated dependencies: {self.outdated_dependencies}",
            f"  Security vulnerabilities: {self.security_vulnerabilities}",
            "",
            "Test Coverage:",
            f"  Overall coverage: {self.test_coverage_percent:.1f}%",
            f"  Untested critical paths: {len(self.untested_critical_paths)}",
            "",
            "Operational:",
            f"  Undocumented services: {self.undocumented_services}",
            f"  Alerts without runbooks: {self.alerts_without_runbooks}",
            f"  Services with no owner: {self.services_with_no_owner}",
        ]
        return "\n".join(lines)

The debt score is not a precise measurement. It is a direction indicator. A rising score over three months means the organisation is accumulating debt faster than it is addressing it. A falling score means the balance is positive. The number is less important than whether it is moving in the right direction.

The conversation that does not happen

In most engineering organisations, technical debt is discussed in two contexts: in technical team discussions where engineers describe the pain and propose solutions, and in post-incident reviews where the debt is identified as a contributing factor after something breaks.

The conversation that almost never happens is the one between engineering leadership and product leadership about the trade that is being made. Not “we have technical debt” as a general statement. A specific accounting: these are the decisions we have made in the last quarter that have created debt, this is the estimated cost of that debt in engineering velocity over the next six months, and this is what it would take to address it.

This conversation is hard to have because it requires making visible the connection between business decisions and technical consequences. When a product team decides to add three features in a quarter that was already fully resourced, the engineering team ships the features by cutting quality somewhere. The product team experiences the feature delivery. The quality cut is invisible to them. The debt it creates is invisible to them. The slowdown it causes six months later is attributed to engineering inefficiency rather than to the original decision.

Making this connection visible is an engineering leadership responsibility. It is also genuinely difficult because the connection is real but not simple. Technical debt is not solely caused by feature pressure. Not all debt is the result of explicit tradeoffs. The relationship between past decisions and current velocity is hard to quantify precisely.

But hard to quantify precisely is not the same as impossible to communicate. The approximate story can be told even when the precise numbers are unavailable.

A template for the conversation: Context: In Q1, the team delivered [features]. To hit the commitments we had, we deferred [specific work] and made the following explicit tradeoffs: [list of known shortcuts taken]. Consequence: The deferred work has created friction in the following areas: [specific examples of where engineers are slower because of the debt]. Our estimate is that the team is spending approximately [X hours/week] on debt-related overhead that would not exist if we had taken the time to do it properly. The ask: We need [Y weeks] this quarter where [some percentage] of engineering capacity goes to debt reduction rather than feature delivery. The features this displaces are [list]. We believe this investment returns [Z] in velocity improvement by Q3. What we are not asking: We are not asking to stop delivering features. We are asking for a specific, time-bounded allocation to reduce the debt that is slowing down feature delivery.

This conversation is more persuasive than “we need time to pay down technical debt” because it is specific, it connects the debt to recognisable past decisions, and it frames the investment as a velocity improvement rather than as maintenance.

Why “just refactor as you go” does not work

The standard advice for managing technical debt is to refactor continuously, improving the code in small increments as part of regular feature work. This advice is correct in principle and insufficient in practice for a specific reason: not all debt can be addressed incrementally.

Some debt is local. A poorly written function, a missing test for a specific code path, a confusing variable name. This debt can be addressed in passing. Refactoring it takes minutes. Doing it as part of regular work is sensible and sufficient.

Some debt is structural. The architecture that cannot support the scaling requirements without fundamental rethinking. The data model that was designed for a use case the product has outgrown. The service boundary that is in the wrong place and requires every new feature to cross it awkwardly. This debt cannot be addressed in passing. Addressing it requires focused time, a clear plan, and the willingness to temporarily accept slower feature velocity while the structural work is done.

The advice to refactor as you go handles the local category well and the structural category not at all. Teams that follow it exclusively accumulate structural debt that is invisible in any individual code review and enormous in aggregate.

# Classifying debt by addressability.
# Local debt can be addressed in passing.
# Structural debt requires dedicated allocation.

from enum import Enum
from dataclasses import dataclass
from typing import list


class DebtType(Enum):
    LOCAL = "local"           # Addressable in passing, hours of work
    SYSTEMIC = "systemic"     # Requires dedicated time, days of work
    STRUCTURAL = "structural" # Requires planning and allocation, weeks of work


@dataclass
class DebtItem:
    description: str
    debt_type: DebtType
    estimated_days: float
    velocity_impact_per_month: float  # Hours lost per month due to this debt
    affected_areas: list[str]
    created_by: str  # The decision that created this debt
    created_at: str


def prioritise_debt_items(items: list[DebtItem]) -> list[DebtItem]:
    """
    Priority is determined by return on investment:
    velocity hours recovered per day invested.

    Structural debt often has the highest ROI despite the highest cost,
    because it affects every feature that touches the affected area.
    """
    def roi(item: DebtItem) -> float:
        months_of_benefit = 12  # Assume 12 months of benefit
        total_hours_recovered = item.velocity_impact_per_month * months_of_benefit
        investment_hours = item.estimated_days * 8
        return total_hours_recovered / investment_hours if investment_hours > 0 else 0

    return sorted(items, key=roi, reverse=True)


# Example debt inventory for a payment service:
payment_service_debt = [
    DebtItem(
        description="Payment processing logic duplicated across three code paths",
        debt_type=DebtType.SYSTEMIC,
        estimated_days=5,
        velocity_impact_per_month=8,  # 8 hours/month spent on bugs from duplication
        affected_areas=["payment-service", "refund-service", "subscription-service"],
        created_by="Q3 2025 fast-follow features shipped under deadline",
        created_at="2025-09-15",
    ),
    DebtItem(
        description="Payment service tightly coupled to legacy order service",
        debt_type=DebtType.STRUCTURAL,
        estimated_days=20,
        velocity_impact_per_month=20,  # Every payment feature requires order service changes
        affected_areas=["payment-service", "order-service", "checkout-flow"],
        created_by="Original monolith extraction in 2024 that was not completed",
        created_at="2024-03-01",
    ),
    DebtItem(
        description="Missing tests for payment failure edge cases",
        debt_type=DebtType.LOCAL,
        estimated_days=3,
        velocity_impact_per_month=4,  # Production bugs from untested paths
        affected_areas=["payment-service"],
        created_by="Q4 2025 launch deadline, tests deferred",
        created_at="2025-11-30",
    ),
]

prioritised = prioritise_debt_items(payment_service_debt)
# The structural coupling item has the highest ROI despite the highest cost,
# because it affects every feature in three services for a full year.

The allocation model that works

The most effective model for managing technical debt that I have seen in practice is not “refactor as you go” and it is not “dedicate a sprint to debt every quarter.” It is a consistent percentage of engineering capacity allocated to non-feature work, treated as a standing commitment rather than something to be negotiated each cycle.

Twenty percent is the number most often cited, most often ignored, and most often correct. One day per week, per engineer, for work that is not directly feature delivery: refactoring, test writing, documentation, dependency updates, architectural improvement, operational work.

The organisations that maintain this allocation consistently have lower debt levels, higher engineer retention, faster feature velocity (because the debt is not slowing them down), and fewer production incidents. The organisations that treat this allocation as a luxury to be restored after the current crunch have permanent crunches because the debt the crunch creates makes the next crunch more likely.

The number is not magic. The consistency is. A team that gets twenty percent every week for a year is in a fundamentally different position than a team that gets forty percent for two weeks after a major incident and then nothing for the following six months. The inconsistent allocation does not reduce debt. It manages incidents.

# Tracking the allocation in practice.
# If you do not track it, you do not maintain it.

from dataclasses import dataclass
from datetime import datetime, date
from typing import Optional


@dataclass
class SprintAllocation:
    sprint_start: date
    sprint_end: date
    total_story_points: int
    feature_points: int
    debt_reduction_points: int
    operational_points: int
    unplanned_incident_points: int

    @property
    def debt_reduction_percent(self) -> float:
        return self.debt_reduction_points / self.total_story_points

    @property
    def planned_vs_actual(self) -> dict:
        planned_feature_pct = 0.70
        planned_debt_pct = 0.20
        planned_ops_pct = 0.10

        return {
            "feature": {
                "planned": planned_feature_pct,
                "actual": self.feature_points / self.total_story_points,
            },
            "debt": {
                "planned": planned_debt_pct,
                "actual": self.debt_reduction_percent,
            },
            "operational": {
                "planned": planned_ops_pct,
                "actual": self.operational_points / self.total_story_points,
            },
        }


def check_allocation_health(sprints: list[SprintAllocation]) -> str:
    if not sprints:
        return "No data"

    avg_debt_pct = sum(s.debt_reduction_percent for s in sprints) / len(sprints)
    sprints_below_target = sum(
        1 for s in sprints if s.debt_reduction_percent < 0.15
    )

    if avg_debt_pct >= 0.18 and sprints_below_target <= 1:
        status = "HEALTHY"
    elif avg_debt_pct >= 0.10:
        status = "AT RISK"
    else:
        status = "ACCUMULATING DEBT"

    return (
        f"Allocation health: {status}\n"
        f"Average debt reduction: {avg_debt_pct:.0%} (target: 20%)\n"
        f"Sprints below 15% threshold: {sprints_below_target}/{len(sprints)}\n"
    )

The engineer’s role

Having argued that technical debt is a management problem, I want to be precise about what this does and does not mean for engineers.

It does not mean engineers are passive recipients of conditions they cannot influence. Engineers who can communicate the cost of debt clearly, who can quantify its impact on velocity, who can make the case for investment in specific concrete terms, have real influence over the decisions that create and address it. The communication is a skill. Developing it matters.

It does not mean engineers should stop refactoring in the absence of explicit allocation. Local debt can and should be addressed in passing. Code should be left cleaner than it was found. These habits matter and they are entirely within the engineer’s control.

It does mean that engineers should stop accepting personal responsibility for systemic conditions. When an engineer internalises the technical debt as their failure, as evidence that they are not good enough or fast enough or disciplined enough, they are accepting an attribution that is incorrect and harmful. The debt exists because of decisions made about priorities. Those decisions were made by the organisation. The organisation is responsible for addressing them.

The engineer’s job is to identify the debt, quantify its cost, make the case for addressing it, and do what they can within the time they are given. The manager’s job is to ensure that time is given.

When the time is not given, the debt accumulates. When the debt accumulates enough, things break. When things break, someone asks why the engineers let it get this bad.

That question has an answer. It is just not the answer most people are looking for when they ask it.

Technical debt is the compound interest on the time the organisation chose not to invest.

The engineers did not choose.