Flaky Integration Tests in CI: Find and Fix Nondeterministic Failures

Flaky integration tests in CI are usually not random. They are tests that depend on state, timing, ordering, or environment behavior the suite does not control.

That is why the same test can pass locally, pass on rerun, and still fail again tomorrow. The code may be correct. The test may also be revealing a real weakness in the way the system is exercised.

Integration tests are valuable because they cross real boundaries: request handling, authentication, database writes, transactions, queues, background workers, and response serialization. Those same boundaries make them sensitive to shared state and timing. If the suite does not isolate those conditions, CI becomes a coin toss instead of a release signal.

This article is part of Testing And Software Delivery. For the broader shape of API integration coverage, start with How to Write API Integration Tests. For the bigger reason green suites still miss production behavior, see Why Tests Pass but Production Still Breaks.

Why Flaky Integration Tests Usually Show Up In CI First

CI changes the conditions around a test.

Locally, a developer often runs one file, one worker, one database, and one machine that already has warm dependencies. CI may run many files in parallel, start fresh containers, use slower disks, share a database between workers, or schedule jobs differently under load.

That difference matters because integration tests often depend on more than one process.

Common hidden assumptions include:

Hidden assumption	CI reality	Typical symptom
Tests run in the same order every time	Workers split files differently	Failure disappears when one file runs alone
The database starts clean	A previous test left rows, sequences, jobs, or locks behind	Unique constraint errors or unexpected counts
Async work finishes immediately	Queues, timers, retries, and transactions complete later	Assertion reads state before the system settles
Clock time is irrelevant	Time zones, date boundaries, and scheduled work differ	Fails near midnight, month end, or daylight changes
External dependencies are ready instantly	Service containers may accept TCP before being app-ready	First test fails, rerun passes
One test owns the fixture	Parallel workers mutate the same account, tenant, or object	Random status, missing row, or already-processed row

The first useful move is not to add retries. It is to identify which assumption the flaky test is borrowing from the environment.

Start With Evidence, Not Reruns

A rerun can make a flaky test green without teaching you anything.

Before changing code, make CI failures leave enough evidence to classify the flake. At minimum, log these values when an integration test fails:

test name and file
CI job id and attempt number
worker id or shard id
database name or schema name
random seed, if the runner supports one
tenant/user/order ids used by the test
current time and configured time zone
last relevant database rows
pending queue jobs or background-worker attempts
dependency stub calls

A small helper is often enough:

async function dumpIntegrationTestState(testName: string) {
  const pendingJobs = await db.job.findMany({
    where: { status: { in: ['queued', 'running', 'retrying'] } },
    take: 20,
    orderBy: { createdAt: 'desc' },
  })

  const recentOrders = await db.order.findMany({
    take: 20,
    orderBy: { createdAt: 'desc' },
    select: {
      id: true,
      testRunId: true,
      status: true,
      updatedAt: true,
    },
  })

  console.error(
    JSON.stringify(
      {
        testName,
        workerId: process.env.TEST_WORKER_ID,
        ciRunId: process.env.GITHUB_RUN_ID,
        timezone: Intl.DateTimeFormat().resolvedOptions().timeZone,
        now: new Date().toISOString(),
        pendingJobs,
        recentOrders,
      },
      null,
      2
    )
  )
}

The output should make one question easier to answer:

Did this test fail because the application behavior is wrong, or because the test environment was not controlled?

Both are worth fixing. They just require different fixes.

Classify The Flake Before Fixing It

Most flaky integration tests fit one of five buckets.

Bucket	What to look for	Better fix
Shared state	Reused emails, order ids, tenants, idempotency keys, or queues	Unique test data, per-worker namespaces, deterministic cleanup
Bad cleanup	Rows, locks, messages, files, or sequence values survive a test	Transaction rollback, truncation, schema recreation, isolated DBs
Timing assumption	Sleeps, polling gaps, background jobs, real clocks	Await observable state, freeze time, use bounded polling
Parallel worker conflict	Test passes alone but fails in full CI	Per-worker databases, schema names, ports, and fixture ownership
Environment readiness	First test fails, rerun passes	Health checks, explicit readiness probes, stable service bootstrap

Do not skip this step.

If the cause is shared state, increasing timeouts only makes the suite slower. If the cause is async work, truncating the database harder may hide the wrong thing. If the cause is a real race condition, marking the test flaky throws away useful signal.

Give Every Test Its Own Data

The easiest way to create a flaky integration suite is to reuse realistic-looking fixture values.

This looks harmless:

const user = await createUser({
  email: 'integration@example.com',
})

await request(app).post('/api/orders').set('Authorization', tokenFor(user)).send(payload)

It works until another test creates the same email, a previous run leaves the row behind, or two workers reserve the same account.

Prefer test data that carries a run namespace:

function testId(name: string) {
  return [
    process.env.GITHUB_RUN_ID ?? 'local',
    process.env.TEST_WORKER_ID ?? 'w0',
    name,
    crypto.randomUUID(),
  ].join('_')
}

const runId = testId('order-create')

const user = await createUser({
  email: `${runId}@example.test`,
  testRunId: runId,
})

const response = await request(app)
  .post('/api/orders')
  .set('Authorization', tokenFor(user))
  .set('Idempotency-Key', runId)
  .send(payload)

expect(response.status).toBe(201)
expect(await db.order.count({ where: { testRunId: runId } })).toBe(1)

The key idea is ownership. The test should be able to say, "these rows belong to this run."

That gives you safer assertions and safer cleanup. It also makes failure logs easier to read because the data points back to the test that created it.

Choose A Database Cleanup Strategy Deliberately

Database cleanup is not a detail. It is part of the correctness model of an integration suite.

The common options each have trade-offs:

Cleanup strategy	Works well when	Watch out for
Transaction rollback	The app and test share one database connection or test scope	Harder when the app opens its own pool, commits, or runs workers
Table truncation	The schema is moderate and tests can reset between cases	Must include join tables, sequences, and dependent rows
Schema per worker	CI runs tests in parallel against one database server	Requires search path or connection configuration discipline
Database per worker	Strong isolation matters more than startup cost	Slower setup and more infrastructure work
Run-scoped data cleanup	Tests tag rows with a `testRunId`	Leaves risk if code can read across the test namespace by mistake

For PostgreSQL, TRUNCATE can reset tables quickly, and RESTART IDENTITY resets associated sequences. The official PostgreSQL documentation covers the behavior and locking implications of TRUNCATE.

A simple reset may look like this:

TRUNCATE TABLE
  outbox_events,
  payment_attempts,
  order_items,
  orders,
  users
RESTART IDENTITY CASCADE;

That can be fine for serial tests. It is dangerous if two workers share the same database and one worker truncates rows while another worker is asserting behavior.

For parallel CI, isolate by worker:

const workerId = process.env.TEST_WORKER_ID ?? '0'
const databaseUrl = `${process.env.TEST_DATABASE_URL}_${workerId}`

beforeAll(async () => {
  await createDatabaseIfMissing(databaseUrl)
  await runMigrations(databaseUrl)
})

beforeEach(async () => {
  await resetDatabase(databaseUrl)
})

If separate databases are too expensive, use separate schemas or strict testRunId scoping. The important rule is simple:

One worker should not be able to delete, mutate, or assert another worker's data.

Replace Sleeps With Observable Conditions

Many flaky tests contain a sleep that used to be "long enough."

await request(app).post('/api/orders').send(payload)

await sleep(500)

const order = await db.order.findFirst({
  where: { externalReference: payload.externalReference },
})

expect(order?.status).toBe('confirmed')

This test is not waiting for the system. It is waiting for the clock.

If CI is slow, the background worker may not finish in 500 ms. If CI is fast, the sleep only wastes time. If the worker crashes, the test still waits and then fails with weak evidence.

Prefer bounded polling against the state that matters:

async function eventually<T>(
  read: () => Promise<T>,
  assert: (value: T) => void,
  { timeoutMs = 5000, intervalMs = 100 } = {}
) {
  const deadline = Date.now() + timeoutMs
  let lastError: unknown

  while (Date.now() < deadline) {
    const value = await read()

    try {
      assert(value)
      return
    } catch (error) {
      lastError = error
      await sleep(intervalMs)
    }
  }

  throw lastError
}

await request(app).post('/api/orders').send(payload)

await eventually(
  () => db.order.findFirst({ where: { externalReference: payload.externalReference } }),
  (order) => {
    expect(order?.status).toBe('confirmed')
  }
)

This still has a timeout, but the timeout now protects a meaningful condition. The test waits for a durable effect, not for an arbitrary delay.

The same rule applies to queues, outbox relays, emails, webhooks, and cache invalidation. Wait for the observable behavior the system promises.

Control Time When Time Is Not The Subject

Some flakes happen because the test accidentally depends on the real clock.

Examples:

an order expires at midnight UTC
a trial calculation crosses a month boundary
a token expires while CI is slow
a scheduled job runs during the test
local time zone differs from the CI runner time zone

If the behavior under test is not "what happens as time passes," freeze the clock.

beforeEach(() => {
  clock.freeze(new Date('2026-05-30T10:00:00.000Z'))
})

afterEach(() => {
  clock.restore()
})

Also make time zone explicit in CI:

env:
  TZ: UTC

Do not freeze time inside every layer blindly. The application, database, and test runner may get time from different places. If the database uses now() and the application uses a fake JavaScript clock, your test may still be inconsistent.

For workflows where database time matters, inject the timestamp as part of the command or assert with tolerances.

Make Service Startup Deterministic

Another common CI flake is the first integration test failing because a dependency was not ready.

A container can be running before the application inside it is ready to accept useful work. For example, PostgreSQL may accept connections before migrations finish. A local HTTP stub may open a port before it has loaded fixtures. A search service may respond before indexes are created.

In GitHub Actions, service containers support health checks through container options, and the platform documents service containers for CI dependencies in its official documentation.

The principle is not GitHub-specific:

services:
  postgres:
    image: postgres:16
    env:
      POSTGRES_PASSWORD: postgres
    options: >-
      --health-cmd pg_isready
      --health-interval 10s
      --health-timeout 5s
      --health-retries 5

Then make the test bootstrap explicit:

beforeAll(async () => {
  await waitForDatabase()
  await runMigrations()
  await resetDatabase()
  await waitForWorker()
})

Readiness should be something the system actually needs. Port open is weaker than health endpoint ready. Health endpoint ready is weaker than migrations applied if the test needs schema.

Reproduce The CI Shape Locally

A flaky CI test should be forced through the same shape that made it fail.

Useful reproduction commands usually combine:

repeated runs
parallel workers
randomized order, if supported
the same time zone as CI
the same database reset mode
the same dependency stubs

For example:

TZ=UTC TEST_WORKER_ID=0 yarn test:integration --runInBand path/to/order.test.ts

Then increase pressure:

TZ=UTC yarn test:integration --maxWorkers=4 --repeat=50

The exact flags depend on the runner. Jest documents setup and teardown hooks such as beforeEach and afterEach in its official setup and teardown guide. Playwright's test documentation emphasizes keeping tests isolated in its best practices.

The runner is less important than the discipline:

reproduce the failure shape
identify the borrowed assumption
remove the assumption
keep a regression test that would fail without the fix

Be Careful With Quarantine

Sometimes a team quarantines a flaky test so the release can continue. That can be reasonable during an incident. It should not become the normal fix.

If you quarantine, require:

owner
issue link
failure evidence
first observed date
reason it is safe to ignore temporarily
deadline for removal

Without those fields, quarantine becomes a quiet way to delete release signal.

A flaky integration test is especially important because it often sits near a real boundary. It may be pointing at unclear transaction behavior, missing idempotency, unsafe shared fixtures, or a background job that cannot be observed reliably.

That overlaps with the problems covered in Database Transaction Boundaries in Backend APIs and How to Prevent Race Conditions in Backend Systems. When the flake touches those boundaries, treat it as design feedback, not only test maintenance.

A Flaky Integration Test Checklist

When an integration test flakes in CI, check these in order:

Can the failure log identify the test, worker, run id, database/schema, and relevant fixture ids?
Does the test use globally reused emails, ids, tenants, idempotency keys, files, ports, or queue names?
Can two workers mutate or clean the same rows?
Does cleanup include dependent tables, join tables, queues, outbox rows, and sequence state?
Does the test wait for durable state instead of sleeping?
Does it depend on real clock time, local time zone, or date boundaries?
Are service containers actually ready before tests start?
Does the failure reproduce only under parallelism?
Does the test assert behavior through the real boundary it claims to protect?
If quarantined, does it have an owner, reason, and deadline?

The fix should make the suite more deterministic, not just quieter.

Takeaway

A flaky integration test is a test with an uncontrolled dependency.

Sometimes that dependency is test data. Sometimes it is the database reset strategy. Sometimes it is the clock, a background worker, a service container, or a parallel CI worker. Sometimes it is a real product race condition that the test happened to expose.

Do not start by asking how to make the failure disappear. Ask what the test borrowed from the environment. Then make that dependency explicit, isolated, observable, or removed.