
Flaky Integration Tests in CI: Find and Fix Nondeterministic Failures
Flaky integration tests in CI are usually not random. They are tests that depend on state, timing, ordering, or environment behavior the suite does not control.
That is why the same test can pass locally, pass on rerun, and still fail again tomorrow. The code may be correct. The test may also be revealing a real weakness in the way the system is exercised.
Integration tests are valuable because they cross real boundaries: request handling, authentication, database writes, transactions, queues, background workers, and response serialization. Those same boundaries make them sensitive to shared state and timing. If the suite does not isolate those conditions, CI becomes a coin toss instead of a release signal.
This article is part of Testing And Software Delivery. For the broader shape of API integration coverage, start with How to Write API Integration Tests. For the bigger reason green suites still miss production behavior, see Why Tests Pass but Production Still Breaks.
Why Flaky Integration Tests Usually Show Up In CI First
CI changes the conditions around a test.
Locally, a developer often runs one file, one worker, one database, and one machine that already has warm dependencies. CI may run many files in parallel, start fresh containers, use slower disks, share a database between workers, or schedule jobs differently under load.
That difference matters because integration tests often depend on more than one process.
Common hidden assumptions include:
| Hidden assumption | CI reality | Typical symptom |
|---|---|---|
| Tests run in the same order every time | Workers split files differently | Failure disappears when one file runs alone |
| The database starts clean | A previous test left rows, sequences, jobs, or locks behind | Unique constraint errors or unexpected counts |
| Async work finishes immediately | Queues, timers, retries, and transactions complete later | Assertion reads state before the system settles |
| Clock time is irrelevant | Time zones, date boundaries, and scheduled work differ | Fails near midnight, month end, or daylight changes |
| External dependencies are ready instantly | Service containers may accept TCP before being app-ready | First test fails, rerun passes |
| One test owns the fixture | Parallel workers mutate the same account, tenant, or object | Random status, missing row, or already-processed row |
The first useful move is not to add retries. It is to identify which assumption the flaky test is borrowing from the environment.
Start With Evidence, Not Reruns
A rerun can make a flaky test green without teaching you anything.
Before changing code, make CI failures leave enough evidence to classify the flake. At minimum, log these values when an integration test fails:
- test name and file
- CI job id and attempt number
- worker id or shard id
- database name or schema name
- random seed, if the runner supports one
- tenant/user/order ids used by the test
- current time and configured time zone
- last relevant database rows
- pending queue jobs or background-worker attempts
- dependency stub calls
A small helper is often enough:
async function dumpIntegrationTestState(testName: string) {
const pendingJobs = await db.job.findMany({
where: { status: { in: ['queued', 'running', 'retrying'] } },
take: 20,
orderBy: { createdAt: 'desc' },
})
const recentOrders = await db.order.findMany({
take: 20,
orderBy: { createdAt: 'desc' },
select: {
id: true,
testRunId: true,
status: true,
updatedAt: true,
},
})
console.error(
JSON.stringify(
{
testName,
workerId: process.env.TEST_WORKER_ID,
ciRunId: process.env.GITHUB_RUN_ID,
timezone: Intl.DateTimeFormat().resolvedOptions().timeZone,
now: new Date().toISOString(),
pendingJobs,
recentOrders,
},
null,
2
)
)
}
The output should make one question easier to answer:
Did this test fail because the application behavior is wrong, or because the test environment was not controlled?
Both are worth fixing. They just require different fixes.
Classify The Flake Before Fixing It
Most flaky integration tests fit one of five buckets.
| Bucket | What to look for | Better fix |
|---|---|---|
| Shared state | Reused emails, order ids, tenants, idempotency keys, or queues | Unique test data, per-worker namespaces, deterministic cleanup |
| Bad cleanup | Rows, locks, messages, files, or sequence values survive a test | Transaction rollback, truncation, schema recreation, isolated DBs |
| Timing assumption | Sleeps, polling gaps, background jobs, real clocks | Await observable state, freeze time, use bounded polling |
| Parallel worker conflict | Test passes alone but fails in full CI | Per-worker databases, schema names, ports, and fixture ownership |
| Environment readiness | First test fails, rerun passes | Health checks, explicit readiness probes, stable service bootstrap |
Do not skip this step.
If the cause is shared state, increasing timeouts only makes the suite slower. If the cause is async work, truncating the database harder may hide the wrong thing. If the cause is a real race condition, marking the test flaky throws away useful signal.
Give Every Test Its Own Data
The easiest way to create a flaky integration suite is to reuse realistic-looking fixture values.
This looks harmless:
const user = await createUser({
email: 'integration@example.com',
})
await request(app).post('/api/orders').set('Authorization', tokenFor(user)).send(payload)
It works until another test creates the same email, a previous run leaves the row behind, or two workers reserve the same account.
Prefer test data that carries a run namespace:
function testId(name: string) {
return [
process.env.GITHUB_RUN_ID ?? 'local',
process.env.TEST_WORKER_ID ?? 'w0',
name,
crypto.randomUUID(),
].join('_')
}
const runId = testId('order-create')
const user = await createUser({
email: `${runId}@example.test`,
testRunId: runId,
})
const response = await request(app)
.post('/api/orders')
.set('Authorization', tokenFor(user))
.set('Idempotency-Key', runId)
.send(payload)
expect(response.status).toBe(201)
expect(await db.order.count({ where: { testRunId: runId } })).toBe(1)
The key idea is ownership. The test should be able to say, "these rows belong to this run."
That gives you safer assertions and safer cleanup. It also makes failure logs easier to read because the data points back to the test that created it.
Choose A Database Cleanup Strategy Deliberately
Database cleanup is not a detail. It is part of the correctness model of an integration suite.
The common options each have trade-offs:
| Cleanup strategy | Works well when | Watch out for |
|---|---|---|
| Transaction rollback | The app and test share one database connection or test scope | Harder when the app opens its own pool, commits, or runs workers |
| Table truncation | The schema is moderate and tests can reset between cases | Must include join tables, sequences, and dependent rows |
| Schema per worker | CI runs tests in parallel against one database server | Requires search path or connection configuration discipline |
| Database per worker | Strong isolation matters more than startup cost | Slower setup and more infrastructure work |
| Run-scoped data cleanup | Tests tag rows with a testRunId | Leaves risk if code can read across the test namespace by mistake |
For PostgreSQL, TRUNCATE can reset tables quickly, and RESTART IDENTITY resets associated sequences. The official PostgreSQL documentation covers the behavior and locking implications of TRUNCATE.
A simple reset may look like this:
TRUNCATE TABLE
outbox_events,
payment_attempts,
order_items,
orders,
users
RESTART IDENTITY CASCADE;
That can be fine for serial tests. It is dangerous if two workers share the same database and one worker truncates rows while another worker is asserting behavior.
For parallel CI, isolate by worker:
const workerId = process.env.TEST_WORKER_ID ?? '0'
const databaseUrl = `${process.env.TEST_DATABASE_URL}_${workerId}`
beforeAll(async () => {
await createDatabaseIfMissing(databaseUrl)
await runMigrations(databaseUrl)
})
beforeEach(async () => {
await resetDatabase(databaseUrl)
})
If separate databases are too expensive, use separate schemas or strict testRunId scoping. The important rule is simple:
One worker should not be able to delete, mutate, or assert another worker's data.
Replace Sleeps With Observable Conditions
Many flaky tests contain a sleep that used to be "long enough."
await request(app).post('/api/orders').send(payload)
await sleep(500)
const order = await db.order.findFirst({
where: { externalReference: payload.externalReference },
})
expect(order?.status).toBe('confirmed')
This test is not waiting for the system. It is waiting for the clock.
If CI is slow, the background worker may not finish in 500 ms. If CI is fast, the sleep only wastes time. If the worker crashes, the test still waits and then fails with weak evidence.
Prefer bounded polling against the state that matters:
async function eventually<T>(
read: () => Promise<T>,
assert: (value: T) => void,
{ timeoutMs = 5000, intervalMs = 100 } = {}
) {
const deadline = Date.now() + timeoutMs
let lastError: unknown
while (Date.now() < deadline) {
const value = await read()
try {
assert(value)
return
} catch (error) {
lastError = error
await sleep(intervalMs)
}
}
throw lastError
}
await request(app).post('/api/orders').send(payload)
await eventually(
() => db.order.findFirst({ where: { externalReference: payload.externalReference } }),
(order) => {
expect(order?.status).toBe('confirmed')
}
)
This still has a timeout, but the timeout now protects a meaningful condition. The test waits for a durable effect, not for an arbitrary delay.
The same rule applies to queues, outbox relays, emails, webhooks, and cache invalidation. Wait for the observable behavior the system promises.
Control Time When Time Is Not The Subject
Some flakes happen because the test accidentally depends on the real clock.
Examples:
- an order expires at midnight UTC
- a trial calculation crosses a month boundary
- a token expires while CI is slow
- a scheduled job runs during the test
- local time zone differs from the CI runner time zone
If the behavior under test is not "what happens as time passes," freeze the clock.
beforeEach(() => {
clock.freeze(new Date('2026-05-30T10:00:00.000Z'))
})
afterEach(() => {
clock.restore()
})
Also make time zone explicit in CI:
env:
TZ: UTC
Do not freeze time inside every layer blindly. The application, database, and test runner may get time from different places. If the database uses now() and the application uses a fake JavaScript clock, your test may still be inconsistent.
For workflows where database time matters, inject the timestamp as part of the command or assert with tolerances.
Make Service Startup Deterministic
Another common CI flake is the first integration test failing because a dependency was not ready.
A container can be running before the application inside it is ready to accept useful work. For example, PostgreSQL may accept connections before migrations finish. A local HTTP stub may open a port before it has loaded fixtures. A search service may respond before indexes are created.
In GitHub Actions, service containers support health checks through container options, and the platform documents service containers for CI dependencies in its official documentation.
The principle is not GitHub-specific:
services:
postgres:
image: postgres:16
env:
POSTGRES_PASSWORD: postgres
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
Then make the test bootstrap explicit:
beforeAll(async () => {
await waitForDatabase()
await runMigrations()
await resetDatabase()
await waitForWorker()
})
Readiness should be something the system actually needs. Port open is weaker than health endpoint ready. Health endpoint ready is weaker than migrations applied if the test needs schema.
Reproduce The CI Shape Locally
A flaky CI test should be forced through the same shape that made it fail.
Useful reproduction commands usually combine:
- repeated runs
- parallel workers
- randomized order, if supported
- the same time zone as CI
- the same database reset mode
- the same dependency stubs
For example:
TZ=UTC TEST_WORKER_ID=0 yarn test:integration --runInBand path/to/order.test.ts
Then increase pressure:
TZ=UTC yarn test:integration --maxWorkers=4 --repeat=50
The exact flags depend on the runner. Jest documents setup and teardown hooks such as beforeEach and afterEach in its official setup and teardown guide. Playwright's test documentation emphasizes keeping tests isolated in its best practices.
The runner is less important than the discipline:
- reproduce the failure shape
- identify the borrowed assumption
- remove the assumption
- keep a regression test that would fail without the fix
Be Careful With Quarantine
Sometimes a team quarantines a flaky test so the release can continue. That can be reasonable during an incident. It should not become the normal fix.
If you quarantine, require:
- owner
- issue link
- failure evidence
- first observed date
- reason it is safe to ignore temporarily
- deadline for removal
Without those fields, quarantine becomes a quiet way to delete release signal.
A flaky integration test is especially important because it often sits near a real boundary. It may be pointing at unclear transaction behavior, missing idempotency, unsafe shared fixtures, or a background job that cannot be observed reliably.
That overlaps with the problems covered in Database Transaction Boundaries in Backend APIs and How to Prevent Race Conditions in Backend Systems. When the flake touches those boundaries, treat it as design feedback, not only test maintenance.
A Flaky Integration Test Checklist
When an integration test flakes in CI, check these in order:
- Can the failure log identify the test, worker, run id, database/schema, and relevant fixture ids?
- Does the test use globally reused emails, ids, tenants, idempotency keys, files, ports, or queue names?
- Can two workers mutate or clean the same rows?
- Does cleanup include dependent tables, join tables, queues, outbox rows, and sequence state?
- Does the test wait for durable state instead of sleeping?
- Does it depend on real clock time, local time zone, or date boundaries?
- Are service containers actually ready before tests start?
- Does the failure reproduce only under parallelism?
- Does the test assert behavior through the real boundary it claims to protect?
- If quarantined, does it have an owner, reason, and deadline?
The fix should make the suite more deterministic, not just quieter.
Takeaway
A flaky integration test is a test with an uncontrolled dependency.
Sometimes that dependency is test data. Sometimes it is the database reset strategy. Sometimes it is the clock, a background worker, a service container, or a parallel CI worker. Sometimes it is a real product race condition that the test happened to expose.
Do not start by asking how to make the failure disappear. Ask what the test borrowed from the environment. Then make that dependency explicit, isolated, observable, or removed.