Your CI Pipeline Is Lying to You

There’s a number every engineering team watches. The build status. Green means good. Red means something needs fixing. The entire ritual of CI/CD — the commits, the pipeline runs, the Slack notifications, the deploy gates — is built around the assumption that this number is telling you the truth.

It usually isn’t.

I’ve worked with enough pipelines in enough organisations to have developed a strong prior: when a team tells me their CI is green, what they mean is that their CI is passing. Those are not the same thing. A pipeline that passes is a pipeline that has successfully run the checks it was configured to run. Whether those checks are the right checks, whether they cover the failure modes that will actually hurt you, whether the suite is actually preventing bad code from reaching production — these are separate questions, and most teams have never seriously asked them.

This article is about what a pipeline that tells the truth actually looks like, why most pipelines drift away from that over time, and what it costs you when they do.

How pipelines start lying

Pipelines don’t start dishonest. When a team first sets up CI, they add checks that reflect real concerns. Unit tests for the logic they’ve written. A linter to catch obvious style issues. Maybe a build step to verify the application compiles. These checks are honest because they’re connected to things the team is actively thinking about.

Then the codebase grows. The team grows. Time passes.

A test starts flaking — passing sometimes, failing other times, with no code changes in between. Nobody has time to fix it properly, so it gets marked as skipped, or wrapped in a retry, or moved to a separate job that doesn’t gate the merge. The suite passes reliably again. The underlying problem is now invisible.

A new service gets added. The integration tests don’t cover the interaction between the new service and the existing ones because integration tests are slow and the team is moving fast. The unit tests for each service pass. The pipeline is green. The integration boundary is untested.

Coverage thresholds get set at whatever the coverage was when the threshold was introduced, which means they gate on “don’t get worse” rather than “be good.” Over time, the high-coverage paths are the ones that were covered early. The business-critical code added later lives in the uncovered gaps.

A performance regression ships to production. The pipeline had no performance checks. It was never designed to catch this failure mode. The pipeline was green because it wasn’t measuring the thing that broke.

None of these are catastrophic decisions in isolation. Each one is a reasonable response to a real constraint — a flaky test, a time pressure, an evolving architecture. But they compound. The pipeline accumulates technical debt the same way the codebase does, except pipeline debt is harder to see because the artifact it produces — a green build — looks exactly the same whether the checks are meaningful or not.

The four ways a pipeline lies

I’ve found it useful to categorize pipeline dishonesty into four types, because each one has a different cause and a different fix.

Coverage lies. The pipeline reports 80% test coverage. What it doesn’t report: the 20% that’s uncovered is disproportionately the error handling paths, the edge cases, and the code that was added quickly under pressure — exactly the code most likely to fail in production. Coverage as a number tells you how much code has been executed by tests. It tells you almost nothing about whether the tests are testing the right things.

Speed lies. The pipeline runs in eight minutes. To get there, someone parallelised aggressively and cut the integration test suite because it was too slow. The unit tests run in eight minutes. The integration tests run in forty-five minutes and are therefore scheduled nightly, which means a broken integration goes undetected for up to twenty-four hours. The build feels fast. The feedback loop is slow.

Environment lies. The tests pass in CI. They fail in production. The CI environment has different environment variables, different service versions, different network topology, different hardware characteristics than production. Tests that pass in one environment and fail in another are not testing the software — they’re testing the environment. When CI’s environment diverges significantly from production, CI is testing a fiction.

Flake lies. A test has a ten percent failure rate. The pipeline retries failed jobs automatically. Ninety percent of the time the retry passes and the build is green. Ten percent of the time the retry fails and someone reruns the pipeline manually. The test failure is logged nowhere, tracked nowhere, and nobody owns fixing it. Flaky tests are a signal — they’re telling you something about race conditions, timing dependencies, or environmental assumptions in your code. Automatic retries suppress the signal.

What a truthful pipeline looks like

A pipeline that tells the truth has one defining property: when it’s green, you’re confident the code is ready to ship. Not cautiously optimistic. Confident. If you wouldn’t feel comfortable deploying on a green build without additional manual checks, your pipeline is not telling you the truth.

Getting there requires building checks around the failure modes that actually hurt you, not the failure modes that are easy to test.

Test the contract, not just the implementation. Unit tests verify that individual functions behave correctly. They don’t verify that services behave correctly toward each other. Add contract tests — tests that verify the API contract between services — so that when a service changes its output format, the pipeline catches it before the consuming service deploys with a broken assumption.

Pact is the standard tool for this. The pattern is simple: the consumer defines what it expects from the provider, the provider verifies it can meet those expectations, and the CI pipeline runs both sides.

# In your pipeline
- name: Consumer contract tests
  run: npm run test:pact:consumer

- name: Publish contracts to broker
  run: npm run pact:publish

- name: Provider verification
  run: npm run test:pact:provider
  env:
    PACT_BROKER_URL: ${{ secrets.PACT_BROKER_URL }}

Test production behaviour, not test behaviour. The most dangerous environment gap is between how your application behaves when the dependencies it expects are unavailable, slow, or returning errors — versus how it behaves in a clean test environment where everything works. Add chaos to your test suite.

import pytest
from unittest.mock import patch
import httpx

def test_payment_service_timeout_handled_gracefully():
    """
    If the payment service takes more than 3s, we should
    return a 503 with a retry-after header, not a 500.
    """
    with patch("httpx.AsyncClient.post") as mock_post:
        mock_post.side_effect = httpx.TimeoutException("timeout")
        response = client.post("/checkout", json=valid_order)

    assert response.status_code == 503
    assert "retry-after" in response.headers
    assert response.json()["error"] == "payment_service_unavailable"

This test is not testing whether the payment service works. It’s testing whether your service handles the payment service failing. That’s the failure mode that reaches users.

Make flakes first-class failures. Stop retrying failed tests silently. When a test fails and a retry is triggered, log it. Track flake rate per test over time. Set a threshold — any test with a flake rate above two percent is a failing test that is temporarily passing — and treat crossing that threshold as a build failure.

- name: Run tests with flake tracking
  run: |
    pytest \
      --reruns 2 \
      --reruns-delay 1 \
      --report-log=test-results.jsonl \
      -v

- name: Check flake rate
  run: python scripts/check_flake_rate.py test-results.jsonl

# scripts/check_flake_rate.py
import json, sys

FLAKE_THRESHOLD = 0.02  # 2%

results = [json.loads(l) for l in open(sys.argv[1])]
tests = {}

for r in results:
    name = r.get("nodeid")
    if not name:
        continue
    if name not in tests:
        tests[name] = {"runs": 0, "failures": 0}
    tests[name]["runs"] += 1
    if r.get("outcome") == "failed":
        tests[name]["failures"] += 1

flaky = []
for name, stats in tests.items():
    if stats["runs"] < 2:
        continue
    rate = stats["failures"] / stats["runs"]
    if rate > FLAKE_THRESHOLD:
        flaky.append((name, rate))

if flaky:
    print("FLAKY TESTS DETECTED:")
    for name, rate in flaky:
        print(f"  {name}: {rate:.0%} failure rate")
    sys.exit(1)

print("No flaky tests detected.")

Add performance gates. Not performance tests that run separately and produce a report nobody reads. Performance assertions that fail the build.

import time
import pytest

@pytest.mark.performance
def test_product_search_under_200ms(db, client):
    """
    Product search p95 must be under 200ms.
    This is a user-facing SLA, not a guideline.
    """
    seed_products(db, count=10_000)
    
    latencies = []
    for _ in range(50):
        start = time.perf_counter()
        response = client.get("/products/search?q=laptop")
        latencies.append(time.perf_counter() - start)
    
    latencies.sort()
    p95 = latencies[int(len(latencies) * 0.95)]
    
    assert response.status_code == 200
    assert p95 < 0.200, (
        f"Search p95 latency is {p95:.3f}s — exceeds 200ms SLA. "
        f"Investigate before merging."
    )

When this fails, the build fails. Not a warning. Not a report in a dashboard nobody checks. A red build that blocks the merge.

Closing the environment gap

The environment gap — CI behaves differently from production — is the hardest one to close completely, but you can get most of the way there with containers and infrastructure as code.

The principle: the same container that runs in CI should run in production. If your CI pipeline builds a Docker image, runs tests against that image, and then deploys that image — the same artifact, not a rebuilt one — you’ve eliminated an entire class of “works in CI, breaks in production” failures.

# .github/workflows/deploy.yml
jobs:
  build:
    runs-on: ubuntu-latest
    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}
    steps:
      - uses: actions/checkout@v4
      
      - name: Build image
        id: build
        uses: docker/build-push-action@v5
        with:
          push: true
          tags: ghcr.io/${{ github.repository }}:${{ github.sha }}

  test:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Run tests against built image
        run: |
          docker run \
            --rm \
            -e DATABASE_URL=${{ secrets.TEST_DATABASE_URL }} \
            ghcr.io/${{ github.repository }}:${{ github.sha }} \
            pytest

  deploy:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - name: Deploy the image that passed tests
        run: |
          # Deploy the exact SHA that was tested
          # Not a new build. Not a rebuilt artifact. This image.
          kubectl set image deployment/app \
            app=ghcr.io/${{ github.repository }}:${{ github.sha }}

Build once. Test that artifact. Deploy that artifact. The image that runs in production is the image whose tests passed, not an image built from the same source code.

The AI angle

I’d be leaving something out if I didn’t talk about what AI is doing to CI pipelines right now, because it’s relevant and mostly being ignored.

The most immediate practical use: AI-assisted flake detection. A model with access to your test history, your git blame, and your test code can identify which tests are likely to be flaky before they flake — flagging tests with timing dependencies, network calls without proper mocking, or shared state between test cases. This is tedious pattern-matching that humans do badly and models do well.

More interesting: AI-generated test cases for edge cases humans miss. Not replacing your test suite — augmenting it. After you write a function, you run a model against it that generates adversarial inputs: the empty string, the integer overflow, the null in the nested object, the unicode character in the field that was only tested with ASCII. This is where I’ve seen the most immediate value from AI in the testing workflow.

# Example prompt to generate edge case tests
EDGE_CASE_PROMPT = """
Given this function:

{function_source}

Generate pytest test cases that cover:
1. Empty/null inputs for each parameter
2. Boundary values (min, max, just over max)
3. Unexpected types
4. Concurrent access if applicable
5. Any domain-specific edge cases you can infer from the logic

Return only the test code, no explanation.
"""

The output needs review — models generate plausible-looking tests that sometimes test the wrong thing. But as a starting point for edge case coverage, it’s faster than starting from scratch and catches things that would otherwise ship.

What AI is not yet doing well: understanding whether a test suite is covering the right failure modes at a systems level. That requires understanding your production failure history, your architecture, your SLAs, and your users’ behaviour in ways that no model currently has clean access to. That’s still a human judgment call.

The build that’s always green

There’s a temptation, when a pipeline has gone red one too many times, to fix the symptom rather than the cause. Skip the flaky test. Raise the timeout. Add a retry. Get back to green.

This is how pipelines learn to lie.

The build that’s always green because it’s been optimised to pass is not a reliable signal — it’s a cargo cult. The ritual is preserved. The meaning has been lost. You merge with confidence because the status is green, and the status is green because you stopped checking for the things that would make it red.

A build that goes red is giving you information. The correct response to a red build is not to find the fastest path back to green — it’s to understand what the red build is telling you and address it at the root.

That discipline is harder than it sounds. It requires saying, when a test fails at 11pm before a deadline, “this failure means something and we need to understand it” rather than “rerun it and see if it passes.” It requires treating pipeline health as a product with its own standards, its own debt, its own refactoring cycles — not as background infrastructure that exists to turn green.

The teams that do this have pipelines that are slower, more opinionated, and more likely to block a merge. They also have dramatically fewer production incidents, faster diagnosis when things do go wrong, and the actual confidence to deploy on a Friday.

A green build should mean something. Make yours mean it.