Why Tests Pass but Production Still Breaks

Why Tests Pass but Production Still Breaks

A green test suite can still ship a broken production workflow when the tests prove the expected path but not the production conditions around it.

That is why incidents sometimes start with the most annoying sentence in software delivery:

But the tests passed.

The sentence is usually true. It is also incomplete.

Tests are models of behavior. A passing suite says the system behaved correctly under the assumptions represented by that suite. Production then adds conditions the model may not include: retries, concurrent requests, dirty data, partial failures, mixed deployments, slow dependencies, and users doing the same thing twice.

This article is about that gap. Not why tests are useless. The opposite: why tests become more useful when engineers can name what they do and do not prove.

If you are working through broader engineering habits, this fits into the same family as software engineering fundamentals: judgment comes from asking what reality can still do after the obvious checks are green. For the release path that connects production-shaped tests with contracts, flags, and migrations, see Testing And Software Delivery.


Why The Failure Feels Impossible

Imagine an API endpoint that creates an order.

The unit tests pass. The integration tests pass. The endpoint returns 201. The database row appears. Validation works. The pull request looks responsible.

Then production creates two orders for one customer action.

The team checks the test suite:

PASS creates an order
PASS rejects an invalid payload
PASS returns the existing order for the same idempotency key
PASS records the payment reference

Nothing in that list is fake. Each test may be valuable. The problem is that the tests did not exercise the part of production that actually mattered.

The duplicate order did not happen because the happy path was wrong. It happened because the happy path was correct only while requests were sequential, the payment provider responded cleanly, and the application recorded state before a retry arrived.

Production broke the workflow between those assumptions.


What A Passing Test Suite Actually Proves

A passing test suite is evidence, not a guarantee.

It proves something narrower than teams sometimes feel during a deploy:

Under the conditions this suite modeled, the system behaved as expected.

That is still valuable. Tests catch regressions, document expected behavior, speed up review, and make change safer. But they do not automatically prove reliability across every state the running system can enter.

Google's SRE book makes a useful distinction here: testing reduces uncertainty about reliability, and production tests exist because live systems are not the same as hermetic test environments. The chapter on testing for reliability is a good reminder that passing offline tests is one layer of confidence, not the whole reliability story.

That distinction matters most when a feature crosses a boundary:

  • a request writes to a database
  • a job is retried
  • an external provider may time out after committing a side effect
  • a deployment runs old and new code at the same time
  • a feature flag changes behavior for only part of traffic

Those are the places where tests often pass while production still finds a state the suite never represented.


A Production Failure The Tests Did Not Model

Consider a simplified checkout flow:

  1. Client sends POST /orders with an idempotency key.
  2. Service validates the request.
  3. Service creates an order.
  4. Service charges the payment provider.
  5. Service records the payment result.
  6. Service returns 201.

The integration test covers the normal path:

const response = await request(app).post('/orders').set('Idempotency-Key', 'checkout_123').send({
  sku: 'pro-plan',
  quantity: 1,
  paymentToken: 'tok_valid',
})

expect(response.status).toBe(201)
expect(response.body.status).toBe('paid')
expect(await orders.count()).toBe(1)

It also covers a sequential duplicate:

await request(app).post('/orders').set('Idempotency-Key', 'checkout_123').send(payload)

const duplicate = await request(app)
  .post('/orders')
  .set('Idempotency-Key', 'checkout_123')
  .send(payload)

expect(duplicate.status).toBe(200)
expect(await orders.countByIdempotencyKey('checkout_123')).toBe(1)

That looks like idempotency coverage. It is not enough.

Here is the production timeline that still breaks:

TimeProduction eventHidden test gap
10:04:12Mobile client sends POST /ordersNormal request path is covered
10:04:13App instance A creates an order draftState is not final yet
10:04:14Payment provider commits the chargeExternal side effect already happened
10:04:15App instance A times out before saving the provider resultTest only modeled clean success or clean failure
10:04:16Client retries with the same idempotency keySequential duplicate test assumed settled state
10:04:16App instance B sees no completed payment recordConcurrent retry path was not covered
10:04:17App instance B charges again or creates another orderThe suite never raced two instances through incomplete state

The suite did not lie. It just answered a smaller question than production asked.

The tests asked:

Does the endpoint work when one request completes before the duplicate request arrives?

Production asked:

Does the workflow remain safe when the first request partially succeeds, times out locally, and the retry arrives before durable state is complete?

Those are different systems from a correctness point of view.


What The Suite Did Not Prove

The most useful post-incident exercise is to translate the incident into assumptions.

Test-suite assumptionProduction realityBetter coverage
Duplicate requests happen after the first request finishesRetries can arrive while the first request is still runningRace two requests with the same idempotency key
Provider calls either succeed or fail cleanlyA provider can commit the side effect while your service times outModel ambiguous external outcomes
Local state is immediately visible and finalAnother app instance may observe incomplete stateAssert durable database invariants
Fixtures represent normal dataReal data includes stale, missing, migrated, or unexpected valuesUse dirtier fixtures and mixed states
Deployment state is cleanOld code, new code, old schema, flags, and queued work overlapTest backward-compatible rollout states

This is a better conversation than "we need more tests."

The team may need more tests, but only the right kind. More happy-path tests would not have caught this incident. More sequential duplicate tests would not have caught it either. The missing coverage was about timing, persistence, and ambiguous side effects.


The Common Reasons Tests Pass But Production Breaks

Test data is cleaner than production data

Fixtures often represent the system the team wishes it had: valid relationships, complete fields, stable enum values, recent rows, and simple tenants.

Production data is rarely that tidy.

A handler might be tested with this shape:

const user = {
  id: 'user_123',
  email: 'dev@example.com',
  plan: 'pro',
  lastLoginAt: '2026-04-15T10:00:00Z',
}

Production may include rows where lastLoginAt is null, a legacy import stored the plan as premium, or the account is attached to a tenant that was migrated months ago.

The code can be correct for the fixture and still wrong for the real dataset.

This is one reason broad testing advice can be misleading. "Add coverage" is too vague. The better instruction is "add coverage for the data shapes production has already proven it can contain."

Time and ordering are too stable in tests

Tests usually run in a calm order:

  1. arrange state
  2. call function or endpoint
  3. wait for it to finish
  4. assert result

Production rarely waits that politely.

Two requests may hit the same row. A background job may observe state before the request path commits. A webhook may arrive before the page refresh that was expected to create local state. A retry may enter the workflow while the original attempt is still deciding whether it succeeded.

That is why idempotency, webhooks, reservations, payments, inventory updates, and status transitions deserve tests that include concurrency when duplicate side effects would be expensive.

For API boundary examples, How to Write API Integration Tests covers the kind of matrix that keeps these cases visible during review.

Mocks are more cooperative than real dependencies

Mocks are useful. They also make dependencies suspiciously well behaved.

A mock usually returns exactly one response, at exactly the expected time, with exactly the fields the application expects. Real dependencies are slower, noisier, versioned, throttled, retried, and occasionally ambiguous.

The most dangerous mock is not the one that fails. It is the one that makes an external system look atomic when it is not.

A payment provider can commit a charge while your service receives a timeout. A message broker can deliver the same event twice. A cache can return stale state. An HTTP client can retry a request after the server has already performed the write.

If the test double cannot express those outcomes, the suite may pass while the integration boundary remains under-tested.

Rollout state is absent

Many suites test the final intended state. Production spends a lot of time between states.

During a real deploy, old and new application versions can run together. Database schema can be ahead of application code. A feature flag can send only part of traffic through the new path. A queue can contain work produced by the old version but consumed by the new one.

That gap is why safe database migrations are phased and why feature flags can increase system complexity when flag state becomes part of runtime behavior.

The final design may be correct. The transition can still be unsafe.


How To Add The Test That Was Actually Missing

After a production-only bug, do not start by adding an assertion near the line that failed. Start by naming the production condition that was missing from the suite.

For the duplicate order incident, a weak follow-up test would only repeat the sequential duplicate case with a different fixture.

A stronger follow-up test races the workflow:

const key = 'checkout_123'

const [first, second] = await Promise.all([
  request(app).post('/orders').set('Idempotency-Key', key).send(payload),
  request(app).post('/orders').set('Idempotency-Key', key).send(payload),
])

expect([200, 201]).toContain(first.status)
expect([200, 201, 409]).toContain(second.status)
expect(await orders.countByIdempotencyKey(key)).toBe(1)
expect(await payments.countByIdempotencyKey(key)).toBe(1)

The exact status code policy depends on the API. The invariant matters more than the surface response: one logical operation should create one durable order and one charge.

If the bug involved an ambiguous provider result, add a test for that state too:

paymentProvider.charge.mockImplementationOnce(async () => {
  await providerLedger.recordCharge({ id: 'ch_123', key })
  throw new TimeoutError('provider response timed out')
})

await request(app).post('/orders').set('Idempotency-Key', key).send(payload)

const retry = await request(app).post('/orders').set('Idempotency-Key', key).send(payload)

expect(retry.body.chargeId).toBe('ch_123')
expect(await providerLedger.countChargesForKey(key)).toBe(1)

This is not a complete payment implementation. It is a test shape. It forces the suite to represent the thing production did: the external side effect may have succeeded even though the local request looked failed.

That is also where API idempotency keys become a design mechanism, not just a test case. The test should verify the storage rule that makes the operation safe.


What Good Incident Follow-Up Looks Like

A good test added after an incident has two jobs.

First, it prevents the specific regression from returning.

Second, it documents a production condition the system must survive.

That second job is the one teams often miss. The test should make the old hidden assumption visible.

Useful incident follow-up usually includes:

  • the production timeline that exposed the gap
  • the assumption the old suite made
  • the invariant the system must preserve
  • the smallest test that reproduces the missing condition
  • a note about where this condition could appear again

For the order example, the invariant is not "the endpoint returns 201." It is:

For one idempotency key and one logical checkout, the system records at most one order and at most one provider charge, even if the first attempt times out locally.

That is a stronger statement. It is also easier to review, because the test now explains what kind of production reality it represents.


Why More Tests Can Still Leave The Same Blind Spot

When tests pass but production breaks, the instinct is often to increase test count.

That can help. It can also create a larger suite with the same missing model.

The question is not only "do we have tests?" It is:

  • Which production condition does this test represent?
  • What important condition does it intentionally ignore?
  • Does the test assert the database state, the response, or both?
  • Does it cover the failure after the side effect, not only before it?
  • Does it include old and new state during rollout?
  • Would this test fail if the real incident happened again?

Those questions matter because some test suites become broad but shallow. They cover many endpoints, many validation paths, and many response shapes, but still avoid concurrency, partial failure, persistence semantics, and migration state.

That kind of suite is useful. It is just not sufficient for every risk.


A Review Checklist For Production-Shaped Tests

Use this checklist when a workflow matters enough that a green build would not be reassuring by itself.

  • What real production condition could make this path behave differently?
  • Can two callers perform the same logical operation at once?
  • What happens if the downstream side effect succeeds but the local request fails?
  • Does the test assert durable state after failure, not just the response?
  • Are duplicate delivery, retries, or replay possible?
  • Does the fixture include old, missing, null, or migrated data where relevant?
  • Could old and new app versions run together during deploy?
  • Does the mock hide behavior that the real dependency is allowed to produce?
  • Is there an invariant stronger than the exact status code?

The goal is not to test every possible universe. The goal is to stop pretending the calmest universe is the only one production will create.


What Junior Developers Should Take From This

For newer developers, this lesson can feel unfair. Testing is often taught as:

write tests, then trust the tests

A better version is:

write tests, then understand what reality they represent

That makes testing more powerful, not less.

When you add or review a test, ask what assumption it protects. Is the request sequential? Is the data clean? Is the dependency atomic? Is the rollout complete? Is the database behavior real or mocked?

Those questions are part of engineering judgment. They help turn tests from a checklist item into a tool for reasoning about systems.


Final Takeaway

Tests pass but production still breaks when the suite models a smaller, calmer version of the system than the one users actually exercise.

The durable response is not "test everything." It is to make the important assumptions visible, then add tests around the production conditions that have real consequence: retries, concurrency, dirty data, ambiguous dependency outcomes, persistence invariants, and rollout state.

A green build should give confidence. It should not end the conversation about what the software will do once production starts adding its own timing, history, and failure modes.