The Rewrite That Wasn't Worth It

Engineering teams propose rewrites with confidence and complete them with regret. The second system is almost never as much better as it was supposed to be, and the cost is almost always more than anyone planned. Here is why, and what to do instead.

software rewrite

At some point in the life of almost every significant software system, someone proposes a rewrite. The codebase has accumulated enough complexity, enough inconsistency, enough decisions made under constraints that no longer apply, that the proposal feels not just reasonable but obvious. The old system is a mess. A clean start would be faster to build on, easier to understand, simpler to operate. The technical debt has become so heavy that paying it off incrementally seems harder than starting fresh.

The proposal gets approved. The rewrite begins. Months pass. Sometimes years. The new system is built. It is deployed. And then something uncomfortable happens: the new system has different problems than the old one, but it is not obviously better. It is faster in some ways and slower in others. It handles the cases the team thought about and struggles with the cases the old system had quietly learned to handle over years of production. The old system’s bugs were known. The new system’s bugs are unknown, which is worse.

The team looks at what they built and tries to remember what exactly was so bad about the old system that justified twelve months of this.

I have been through this more than once. I have watched teams go through it more times than that. The rewrite is one of the most seductive ideas in software engineering and one of the most reliably disappointing outcomes. Understanding why requires being honest about what the old system actually was, what the new system actually is, and what the proposal was actually solving.

What the old system really contains

When an engineer looks at an old codebase and sees a mess, they are seeing something real. The code is inconsistent. The architecture has grown without a plan. There are three ways of doing the same thing because three different teams built those three features at three different times. The naming conventions differ across files. The test coverage is uneven. The deployment process has manual steps that nobody can fully explain.

What they are not seeing, because it is invisible, is the accumulated knowledge encoded in the system.

Every weird special case in the old codebase exists because at some point the real world presented that case and someone had to handle it. The order processing code that treats orders from a specific country differently does so because there was an incident, or a regulatory requirement, or a customer complaint that caused someone to add that handling. The validation rule that seems arbitrary has a story behind it. The retry logic that looks overcomplicated was made complicated by a real failure mode.

This knowledge is not in the code in the form of comments. It is in the shape of the code. It is in the conditionals and the edge cases and the error handling. It exists in a form that is hard to read but is nonetheless real and important.

When you rewrite, you do not copy this knowledge. You copy the parts you understand and notice, which are the parts that are explicit and legible. The parts that are implicit, that live in the handling of edge cases nobody thought to document, those do not make it into the rewrite. They get rediscovered in production, one incident at a time, over the following year.

Joel Spolsky called this “throwing away the map.” The old code, however ugly, is a map of every problem the system has encountered and survived. The rewrite throws away the map and asks the team to navigate the same terrain again from scratch.

# What the rewrite team sees:
def calculate_shipping_cost(order):
    base_rate = 5.99
    if order.weight > 10:
        base_rate += (order.weight - 10) * 0.50
    if order.destination_country == "CA":
        base_rate *= 1.15
    if order.destination_state in ["AK", "HI"]:
        base_rate += 15.00
    if order.is_po_box and order.carrier == "UPS":
        base_rate += 3.00
    if order.created_at.weekday() == 4 and order.created_at.hour >= 15:
        base_rate += 2.50
    return round(base_rate, 2)


# What the rewrite team writes:
def calculate_shipping_cost(order):
    base_rate = Decimal("5.99")
    weight_surcharge = max(Decimal("0"), (order.weight - 10) * Decimal("0.50"))
    return base_rate + weight_surcharge


# What the rewrite team discovers six months after launch:
# - Canadian orders are being undercharged (tax handling was missing)
# - Alaska and Hawaii customers are calling support about shipping costs
# - UPS is refusing PO box deliveries because the surcharge logic
#   was also suppressing the PO box flag on the shipment record
# - Friday afternoon orders are arriving Monday instead of Tuesday
#   because the carrier cutoff handling was embedded in that
#   Friday 3pm check nobody understood

The original code was ugly. Every conditional in it was a production incident that someone survived. The rewrite lost that institutional memory and is now paying for it again.

The second system effect

Fred Brooks described the second system effect in The Mythical Man-Month in 1975, and it has not stopped being true in the fifty years since.

The first system an engineer builds is constrained by uncertainty. They do not know everything about the domain, so they build cautiously. They do not know everything about the technology, so they make conservative choices. The constraints of ignorance produce a certain discipline.

The second system is built by engineers who have survived the first one. They know what they would have done differently. They know which features the first system needed but did not have. They have a backlog of improvements they could not make to the old system because the old system was too tangled. Now they have a clean slate.

This is dangerous precisely because it feels like opportunity.

The second system gets all the features the first system needed, plus the features the engineers wish the first system had, plus the features that seem like they will obviously be needed even though nobody has asked for them yet, plus the architectural abstractions that will make future features easier even though the future features are still hypothetical. The second system is the first system plus every good idea anyone on the team has had in the last three years.

Second systems are over-engineered. They are over-engineered because they are built by experienced engineers who have finally been given permission to do things right, and doing things right looks like solving not just the current problem but every adjacent problem they can imagine.

The irony is that the first system, which was built under constraint and felt like a compromise, was often close to the right size for the problem. The second system, which was built with experience and intention, is often too large, too abstract, and too slow for the same problem.

# First system: built under time pressure, does the job

class NotificationService:
    def send_email(self, to: str, subject: str, body: str) -> bool:
        try:
            self.smtp_client.send(to=to, subject=subject, body=body)
            return True
        except SMTPError as e:
            log.error("email.send.failed", to=to, error=str(e))
            return False


# Second system: built with architecture in mind,
# solves problems the team does not actually have

class NotificationChannel(ABC):
    @abstractmethod
    async def send(self, notification: Notification) -> NotificationResult:
        pass

class NotificationRouter:
    def __init__(self, channels: dict[str, NotificationChannel]):
        self.channels = channels

    async def route(self, notification: Notification) -> list[NotificationResult]:
        selected_channels = await self.channel_selector.select(notification)
        return await asyncio.gather(*[
            self.channels[channel].send(notification)
            for channel in selected_channels
        ])

class NotificationPipeline:
    def __init__(self, router: NotificationRouter, middleware: list[Middleware]):
        ...

# The team that built this has not added a second channel type in eight months.
# They are still sending email.
# They spent six weeks building a channel abstraction for email.

The second system serves the imagination of the engineers who built it. The first system served the users who needed it. The distance between those two things is where rewrites go wrong.

What the proposal is actually solving

When an engineer proposes a rewrite, the stated reason is almost always technical. The codebase is unmaintainable. The architecture cannot support the features we need. The technology is obsolete. The technical debt is too high to pay down incrementally.

These reasons are often true. They are rarely the complete reason.

Under the technical case, there is almost always a human case. The team is frustrated. They are spending their days inside a system that fights them. Every change requires understanding more context than any one person can hold. Every feature takes longer than it should. The system makes them feel slow and incapable when they want to feel fast and capable.

A rewrite is partly a technical proposal and partly a morale proposal. It says: we could have a codebase that does not fight us. We could work in something we built and understand. We could be fast again.

This is a legitimate need. The mistake is thinking that a rewrite satisfies it more reliably than a sustained improvement effort would.

The rewrite starts with excitement. New codebase, clean architecture, fresh choices. For the first few months it feels exactly like the proposal promised. And then the new system starts accumulating its own complexity. The features that were straightforward to describe are complicated to implement. The edge cases start appearing. The new system starts to feel, six months after launch, like the old system felt two years ago, except now there is also the operational cost of having run two systems in parallel during the transition.

The morale problem was real. The rewrite was not its solution.

When rewriting is actually right

I want to be careful not to argue that rewrites are always wrong, because they are not. There are circumstances where a rewrite is the correct decision, and conflating those with the common case would make this analysis useless.

When the technology is a genuine blocker. A system that must run on a platform that is being discontinued. A system written in a language that the team cannot hire for and cannot maintain. A system that depends on a third-party service that is shutting down. These are cases where the technology itself prevents the system from having a future, and a rewrite is not a preference but a necessity.

When the architecture prevents a required capability. Not a capability you would like to have. A capability the business requires in order to operate. If the monolith genuinely cannot support the isolation required by a new regulatory requirement, and if incremental extraction cannot get there in the time available, a rewrite may be justified. The key word is genuinely. Teams regularly discover that what seemed impossible in the old architecture was possible with more investment than anyone had tried making.

When the system has no tests and cannot be safely modified. A system with no tests, no documentation, no engineers who understand it, and a requirement to change significantly is in a situation where the cost of understanding and safely modifying it may exceed the cost of rewriting it. This is rare and it is a failure of maintenance that led to this point, but it happens and a rewrite is sometimes the least bad option.

When the domain understanding has fundamentally changed. If the business domain the system models has changed so substantially that the system’s model is no longer a useful representation of the domain, incremental improvement may be harder than starting from a correct model. This is different from the domain growing and requiring new features. It is the underlying concepts having changed such that the old model is actively misleading.

Outside these cases, the rewrite is probably not the right call.

The alternative that teams do not try hard enough

The alternative to the rewrite is incremental improvement, and it gets dismissed too quickly in most rewrite conversations because teams have tried it and found it frustrating.

Incremental improvement is frustrating. It is slower than the rewrite promises to be. It requires living inside the old system while improving it, which is uncomfortable. The results are not as visually dramatic as a new codebase. There is no launch day.

But incremental improvement works in a way that rewrites frequently do not, for a reason that sounds simple and is actually profound: it preserves the knowledge encoded in the old system while improving its legibility.

The key is having a strategy rather than just cleaning up whatever looks messy.

# A strategy for incremental improvement:
# The strangler fig pattern applied at the module level

# Step 1: Identify the boundary of the thing to be improved
# The shipping cost calculation is tangled with the order module.
# We want to make it independently testable and modifiable.

# Step 2: Write characterisation tests for the current behaviour.
# Not tests of what it should do. Tests of what it actually does.
# Every edge case, even the ones that seem wrong.

class TestShippingCostCharacterisation:
    """
    These tests document what the current system does.
    They are not tests of correctness.
    They are tests of behaviour.
    When we refactor, they must continue to pass.
    When we intentionally change behaviour, we update them.
    """

    def test_standard_domestic_order(self):
        order = make_order(weight=5, destination_country="US")
        assert calculate_shipping_cost(order) == Decimal("5.99")

    def test_canada_surcharge_applied(self):
        order = make_order(weight=5, destination_country="CA")
        assert calculate_shipping_cost(order) == Decimal("6.89")

    def test_alaska_hawaii_surcharge(self):
        order = make_order(weight=5, destination_state="AK")
        assert calculate_shipping_cost(order) == Decimal("20.99")

    def test_friday_afternoon_cutoff_surcharge(self):
        friday_3pm = datetime(2026, 5, 22, 15, 30)
        order = make_order(weight=5, created_at=friday_3pm)
        assert calculate_shipping_cost(order) == Decimal("8.49")

    # Continue until you have captured every observable behaviour.
    # This is the map. Do not throw it away.


# Step 3: Extract the module behind an interface.
# The old code still runs. The new code is introduced alongside it.

class ShippingCalculator:
    def calculate(self, order: Order) -> Decimal:
        # Start by delegating to the old implementation.
        # The characterisation tests pass.
        return legacy_calculate_shipping_cost(order)


# Step 4: Replace the implementation incrementally,
# keeping the characterisation tests green.
# When you find an edge case you do not understand,
# write a test for it before you change it.
# The understanding is built during the improvement,
# not assumed before it.

class ShippingCalculator:
    def calculate(self, order: Order) -> Decimal:
        rate = ShippingRate(base=Decimal("5.99"))
        rate = self._apply_weight_surcharge(rate, order)
        rate = self._apply_destination_rules(rate, order)
        rate = self._apply_carrier_rules(rate, order)
        rate = self._apply_cutoff_rules(rate, order)
        return rate.total


    def _apply_destination_rules(self, rate: ShippingRate, order: Order) -> ShippingRate:
        if order.destination_country == "CA":
            rate = rate.with_surcharge(Decimal("0.15"), reason="canada_tax")
        if order.destination_state in ("AK", "HI"):
            rate = rate.with_surcharge(Decimal("15.00"), reason="remote_destination")
        return rate

This process is slower than the rewrite. It is also safer. The characterisation tests ensure that every known behaviour is preserved or intentionally changed. The knowledge encoded in the old system is transferred explicitly rather than lost implicitly.

The conversation worth having instead

When a rewrite is proposed, the conversation that usually happens is about the technical state of the old system. The conversation that should happen is different.

What specific problems are we solving? Not “the codebase is a mess” because that is not a problem, that is a description. What features cannot be built in the current system? What bugs cannot be fixed? What performance requirements cannot be met? What operational problems cannot be solved?

If the answers are specific, what is the cost of solving each one incrementally versus the cost of a rewrite? The rewrite cost is almost always higher than estimated and the timeline is almost always longer than planned. What is the specific estimate for each of those?

What knowledge lives in the old system that is not documented and will not survive the rewrite? Who is responsible for ensuring that knowledge is transferred?

What is the plan for operating two systems in parallel during the transition? How long will that parallel operation last? What is the criteria for retiring the old system?

What makes us confident the second system will not develop the same problems as the first? What specifically will be different about how it is built and maintained?

These questions are not arguments against the rewrite. They are the questions that determine whether the rewrite is justified and whether the team has thought through what they are committing to. Teams that can answer them clearly are in a position to make a good decision. Teams that cannot answer them are proposing a rewrite on the basis of frustration rather than analysis.

Frustration is valid. It is not a technical plan.

The thing that actually needs to change

Behind most rewrite proposals is a team that has been asked to move fast in a system that makes moving fast hard. The system makes moving fast hard because it accumulated technical debt over time. The debt accumulated because paying it off was deprioritised in favour of features, repeatedly, over a long period of time.

A rewrite pays off the accumulated debt in one large lump sum payment. It does not change the conditions that caused the debt to accumulate. Those conditions, the feature pressure, the insufficient investment in quality, the absence of time for incremental improvement, will produce the same debt in the new system over the same timescale.

Teams that come out of a rewrite and immediately face the same pressure to deliver features without time for quality will have the same codebase in three years that they were trying to escape. The rewrite bought them a reset. It did not buy them a different relationship with technical debt.

The thing that actually needs to change is the allocation of engineering time between feature delivery and system improvement. This is not a technical decision. It is a leadership decision about what the team is being asked to do and what resources they are being given to do it.

A rewrite can be the right technical decision in specific circumstances. It is almost never sufficient on its own. Without the change in how time is allocated, the new system will be the old system in three years. With that change, incremental improvement can transform the old system without the cost and the risk of replacing it.

The proposal worth making is not “let us rewrite the system.” It is “let us invest in the system consistently, and here is what we will be able to build if we do.” That proposal is harder to make dramatic. It is also more likely to actually solve the problem.