Unit Tests Are Overrated and You Know It

I’m going to say something that will make some people close this tab immediately: most unit tests are not worth the time it takes to write and maintain them, and the culture around unit testing has caused more harm to software quality than it has prevented.

Not all unit tests. Not testing in general. Specifically the orthodoxy that says you should test every function, mock every dependency, aim for maximum coverage, and measure quality by how many green checkmarks your test runner produces.

That orthodoxy is producing codebases that are simultaneously over-tested and under-validated. Teams that spend enormous engineering hours maintaining test suites that don’t catch the bugs that actually affect users. Developers who spend more time making tests pass than making software work. Coverage reports that read ninety percent and services that break every other deployment.

If this makes you uncomfortable, good. Stay with the discomfort for a minute, because the alternative is continuing to do something that doesn’t work while calling it best practice.

What unit tests actually test

Unit test

A unit test tests a unit of code in isolation. The unit is typically a function or a class. The dependencies of that unit — other functions, databases, external services — are replaced with mocks or fakes that return controlled responses.

This is valuable for exactly one category of problem: logic that lives in pure functions, isolated from external state, where the relationship between input and output is the entire thing being tested.

# This is worth unit testing. The logic is the point.
def calculate_discount(
    base_price: Decimal,
    customer_tier: str,
    order_quantity: int,
) -> Decimal:
    if customer_tier == "enterprise":
        tier_discount = Decimal("0.20")
    elif customer_tier == "pro":
        tier_discount = Decimal("0.10")
    else:
        tier_discount = Decimal("0.00")

    quantity_discount = Decimal("0.05") if order_quantity >= 100 else Decimal("0.00")
    total_discount = min(tier_discount + quantity_discount, Decimal("0.25"))

    return base_price * (1 - total_discount)

A unit test for this function is testing the right thing. The function is pure. Its behavior is entirely determined by its inputs. There are no external dependencies to mock. The test directly validates the business logic.

Now look at what most unit tests actually test:

# This is what unit tests look like in most codebases.
@patch('app.services.payment.stripe_client')
@patch('app.services.payment.db')
@patch('app.services.payment.email_service')
@patch('app.services.payment.inventory_service')
async def test_process_payment(
    mock_inventory,
    mock_email,
    mock_db,
    mock_stripe,
):
    mock_stripe.create_payment_intent.return_value = Mock(id="pi_123", status="succeeded")
    mock_db.get_order.return_value = Mock(id="order_1", total=49.99, user_id="user_1")
    mock_inventory.reserve.return_value = True
    mock_email.send.return_value = None

    result = await process_payment("order_1")

    assert result.status == "completed"
    mock_stripe.create_payment_intent.assert_called_once()
    mock_inventory.reserve.assert_called_once()
    mock_email.send.assert_called_once()

What is this test actually testing? It is testing that when everything works exactly as mocked, the function calls the mocked things in the expected order and returns the expected result.

It is not testing what happens when Stripe returns an error. It is not testing what happens when the database is unavailable. It is not testing what happens when inventory reservation fails after payment succeeds, leaving a paid order in a broken state. It is not testing the actual integration between these components.

It is testing that the code is wired together the way it was wired together when the test was written. It is a snapshot of the implementation masquerading as a validation of the behavior.

And it will pass green on every run until the day something real breaks in production, at which point it will still pass green because the mocks are still returning what you told them to return.

The mock problem

Mocks are the original sin of unit testing culture. They were created to solve a real problem — tests that depend on external services are slow, unreliable, and hard to set up — and they solved that problem by replacing the external service with a fake version that does whatever the test needs it to do.

The consequence is that your test suite no longer tests your software. It tests your software’s interaction with your software’s assumptions about how its dependencies behave. When those assumptions are wrong — when the real Stripe API returns a response shape that’s slightly different from what you mocked, when the real database has a different transaction isolation level than your mock assumes, when the real email service deduplicates in a way your mock doesn’t — your tests pass and your production breaks.

I have debugged more production incidents that were caused by the gap between mocked behavior and real behavior than I can count. The test said it worked. The mock said the API returned this. The real API does not return this. The test was wrong about the contract, and because the test was wrong, the code was deployed with a broken assumption that nobody caught.

The more you mock, the less your tests tell you about whether the software works. This is not a design smell to be managed — it’s a fundamental property of mocking. Every mock is a place where reality has been replaced with assumption.

The coverage lie

Coverage is the most destructive metric in software engineering.

Not because high coverage is bad. Because coverage as a target produces the wrong behavior. When coverage is a goal, developers write tests to cover code rather than to validate behavior. These are different activities that produce very different tests.

A test written to cover code asks: how do I execute this line? A test written to validate behavior asks: what should this system do, and how do I know it’s doing it?

Tests written to cover code tend to be thin — they call the function with happy-path inputs and assert that it doesn’t throw. They increase coverage. They do not increase confidence.

# Written to cover code. Gets you to 100% on this function.
def test_create_user():
    user = create_user(email="[email protected]", password="password123")
    assert user is not None

# Written to validate behavior. Tests what actually matters.
def test_create_user_hashes_password():
    user = create_user(email="[email protected]", password="password123")
    assert user.password_hash != "password123"
    assert verify_password("password123", user.password_hash)

def test_create_user_rejects_duplicate_email():
    create_user(email="[email protected]", password="password123")
    with pytest.raises(DuplicateEmailError):
        create_user(email="[email protected]", password="different")

def test_create_user_sends_verification_email(fake_email_sender):
    create_user(email="[email protected]", password="password123")
    assert any(
        email.to == "[email protected]" and "verify" in email.subject.lower()
        for email in fake_email_sender.sent
    )

def test_create_user_with_invalid_email():
    with pytest.raises(ValidationError, match="invalid email"):
        create_user(email="not-an-email", password="password123")

The second set has the same line coverage as the first if the function is simple. It tests fundamentally different things. A system with the first kind of tests has coverage. A system with the second kind has confidence.

Coverage rewards quantity. Confidence comes from quality. These are not correlated, and treating them as if they are has produced an industry-wide habit of writing many low-value tests instead of fewer high-value ones.

What actually breaks in production

Here is a list of things that unit tests, as typically practiced, will never catch:

The query that works correctly against your test database with twenty rows and times out against production with two million rows
The race condition that only manifests when two requests hit the same endpoint within fifty milliseconds of each other
The API response from your payment provider that changed shape slightly in a minor version update
The session expiry behavior that’s different in the production Redis configuration than in the in-memory fake you test against
The cascade delete behavior that your ORM handles differently than the raw SQL you use in the migration script
The encoding issue that only appears when a user’s name contains a character outside the ASCII range
The timeout that is set correctly in the service but not propagated to the client that calls it

Every item on this list is a production incident I have personally been part of. None of them was caught by unit tests. Most of them would have been caught by integration tests that weren’t written because the team was busy maintaining the unit test suite.

This is the trade you make when you prioritize unit testing: you get fast, reliable tests that validate your assumptions, and you skip the slower, harder tests that would challenge them.

What to do instead

I am not arguing for no tests. I’m arguing for tests calibrated to where the real risk is.

Integration tests over unit tests for anything with dependencies.

If a function touches a database, a cache, a message queue, or an external service — test it against the real thing, or as close to the real thing as you can get. Not a mock. Not an in-memory fake that you wrote. A real database with a real schema and real data volumes. A real Redis instance. A real queue.

Yes, these tests are slower. Run them in CI, not on every save. They are dramatically more valuable than unit tests that mock the same dependencies because they test what the code actually does, not what you assumed the code would do.

# This is worth the setup cost. It catches real problems.
@pytest.mark.integration
async def test_process_payment_handles_stripe_card_declined(
    test_db,       # Real PostgreSQL, real schema
    stripe_mock,   # Stripe's own test environment, not our mock
):
    order = await create_test_order(test_db, total=Decimal("49.99"))

    # Stripe's test mode has real card numbers that trigger specific behaviors
    result = await process_payment(
        order_id=order.id,
        card_token="tok_chargeDeclined",  # Stripe test token for declines
    )

    assert result.status == "failed"
    assert result.failure_code == "card_declined"

    # Verify the order status was updated correctly in the real database
    updated_order = await test_db.fetch_one(
        "SELECT status FROM orders WHERE id = $1",
        order.id
    )
    assert updated_order["status"] == "payment_failed"

    # Verify no inventory was reserved for a failed payment
    reservation = await test_db.fetch_one(
        "SELECT id FROM inventory_reservations WHERE order_id = $1",
        order.id
    )
    assert reservation is None

This test uses a real database and Stripe’s test environment. It is slower than a mocked unit test. It tests whether the actual system behaves correctly when a real dependency does something unexpected. It is the test you actually need.

Test behavior at the system boundary, not implementation in the middle.

The most valuable tests are the ones that call your API, your message handler, your batch job — the public interface of your system — and assert on the observable output. Not which functions were called, not which mocks were invoked. What came out.

@pytest.mark.integration
async def test_order_api_returns_correct_status_after_payment(
    client,
    test_db,
):
    # Create an order through the API
    create_response = await client.post("/orders", json={
        "items": [{"product_id": "prod_1", "quantity": 2}]
    })
    assert create_response.status_code == 201
    order_id = create_response.json()["id"]

    # Process payment through the API
    payment_response = await client.post(f"/orders/{order_id}/pay", json={
        "card_token": "tok_visa"
    })
    assert payment_response.status_code == 200

    # Verify the order status reflects the payment
    order_response = await client.get(f"/orders/{order_id}")
    assert order_response.json()["status"] == "confirmed"
    assert order_response.json()["payment"]["status"] == "succeeded"

This test goes through the API, through the service layer, through the database, and back. It validates the entire vertical slice. It would catch a bug in the API handler, a bug in the service logic, a bug in the database query, or a bug in the response serializer. A unit test that mocked all the layers would catch none of these except the one in the specific layer being tested.

Reserve unit tests for pure logic.

Unit tests are excellent for exactly what they’re suited for: pure functions with complex branching logic where the relationship between input and output is the whole point. Discount calculations. Validation rules. Data transformations. Parsing logic. Algorithms.

These are worth unit testing because the test is actually testing the logic. There’s nothing to mock. The test runs in microseconds. Failures tell you exactly what’s wrong.

For everything else — anything that touches infrastructure, anything that coordinates between components, anything that talks to external systems — integration tests are not just better, they’re the only tests that tell you anything true.

The heresy in full

Here is the position I’m staking out, clearly, so it can be clearly disagreed with:

A codebase with forty percent coverage from integration tests that test real behavior against real dependencies is more reliable than a codebase with ninety percent coverage from unit tests that mock every external interaction.

Coverage is not quality. Mocks are not validation. A green test suite is not a guarantee that the software works — it’s a guarantee that the software works according to the assumptions baked into the tests, which may or may not match reality.

The software quality crisis is not a testing crisis. We test more than we ever have. The crisis is a misalignment between what we test and what breaks. We test pure logic obsessively and integration boundaries barely. The bugs live at the integration boundaries. They always have.

The counterargument I hear most often: integration tests are slow. Yes. They are. They are slow because they do real things. Real things take time. The alternative is fast tests that don’t do real things and therefore don’t tell you whether the real things work.

Speed is not a virtue in a test suite. Accuracy is.

I expect this to generate disagreement. That’s fine. The developers most likely to disagree are the ones who have invested the most in unit testing culture, which makes their disagreement somewhat self-referential. The developers most likely to agree quietly are the ones who have been paged at 3am because a perfectly unit-tested function didn’t work the way its mocks said it would.

Those developers know. They’ve always known. This is just someone finally saying it.