How to Prevent Race Conditions in Backend Systems

How to Prevent Race Conditions in Backend Systems

Race conditions are one of the most common reasons backend systems look correct in development and break under real traffic.

The code path seems reasonable. Each request seems valid on its own. The bug appears only when timing changes: two requests overlap, a retry arrives early, or a worker processes the same logical action twice.

That is why race conditions are not mainly about "bad code." They are about correctness depending on event order that the system does not actually control.

For API-specific paths where timing affects retries, integration tests, webhooks, and versioning, see the API Correctness hub.


What A Race Condition Really Is

A race condition happens when the outcome of a workflow depends on which concurrent operation reaches a shared boundary first.

That boundary might be a row update, an insert that must stay unique, a side effect like sending an email or charging a card, a worker claiming a job, or a state transition on a shared record.

The key point is not just that two things happen at once. It is that the final result changes depending on the interleaving.

That is why race conditions are so common in backend systems: multiple requests mutate the same resource, retries replay the same logical action, background workers pick up related work, external callbacks arrive more than once, and two services react to the same state from different directions.


The Short Answer: How To Prevent Them

The most reliable tools are:

  • enforce invariants in the database
  • use optimistic locking when conflicts should fail fast
  • use pessimistic locking when only one actor may proceed
  • add idempotency around retried writes and duplicate delivery
  • make background jobs safe to run more than once
  • test concurrency explicitly

The right choice depends on the invariant you are protecting.

That is the first important shift in thinking: do not start with the tool. Start with the invariant.


A Concrete Example: Inventory Oversell

Suppose two users try to buy the last item at the same time.

The handler looks like this:

async function purchaseProduct(productId: string) {
  const product = await db.product.findUnique({
    where: { id: productId },
  });

  if (!product || product.stock <= 0) {
    throw new Error('Out of stock');
  }

  await db.product.update({
    where: { id: productId },
    data: { stock: product.stock - 1 },
  });
}

This looks correct if you read it sequentially.

Under concurrency:

  1. request A reads stock = 1
  2. request B reads stock = 1
  3. request A updates stock to 0
  4. request B also updates stock to 0

Now two purchases succeeded for one remaining item.

The subtraction was not the problem. The read-check-write sequence was.


A Timeline That Makes The Bug Obvious

Race conditions often feel abstract until you write the actual order of events down.

Request A reads stock = 1 and pauses briefly before updating. Request B reads the same value a few milliseconds later. Both handlers independently conclude that the purchase is valid because each observed a true statement at the time it looked. By the time they write, the invariant has already been broken.

That is why these bugs are so easy to miss in code review. Each line looks reasonable in isolation. The failure appears only when two locally reasonable sequences interleave in a way the code never explicitly defended against.

This is also why reproducing race conditions often requires concurrency, retries, or timing pressure rather than just the "right input."


Why Backend Code Produces These Bugs So Easily

Most backend code follows a familiar pattern:

  1. read current state
  2. decide what should happen
  3. write new state

That pattern feels atomic when reading code. In production it usually is not.

Between the read and the write, another actor may update the same row, insert a conflicting row, retry the same business action, claim the same job, or perform the same state transition first.

That gap is where race conditions live.


The Most Common Race Condition Patterns

Duplicate writes after retries

Examples include duplicate orders, duplicate charges, and duplicate subscription changes.

If retries are part of the story, idempotency is often the right starting point. See API Idempotency Keys.

Double processing in background jobs

Examples include a worker crashing after performing a side effect but before acknowledgment, two workers claiming work that should be exclusive, or a retried job repeating an external call.

That side of the problem connects directly to Background Jobs in Production and PostgreSQL Job Queues with SKIP LOCKED.

Shared-state conflicts

Examples include overselling stock, double-booking a slot, lost account-balance updates, or stale permission and role transitions.

Event and webhook duplication

Examples include the same event being processed twice, callbacks arriving out of order, or one state transition being applied more than once.

For webhook-specific delivery behavior, see Webhook Idempotency and Retries in Production.


Why Transactions Alone Often Do Not Solve It

A very common assumption is:

If I wrap it in a transaction, the race condition is fixed.

Sometimes. Often not.

Transactions give atomicity within the transaction. They do not automatically guarantee that your business invariant is protected from every competing transaction.

That depends on isolation level, lock behavior, unique constraints, query shape, retry behavior, and where side effects happen relative to commit.

The practical question is not:

Am I using a transaction?

It is:

Which concurrent interleavings can still violate the invariant I care about?

If you want the database-level background for that, see SQL Isolation Levels Explained.


The False Fixes Teams Reach For First

Race-condition fixes often go wrong because the first patch removes the visible symptom without protecting the invariant.

Common examples are adding an in-memory lock in one application instance, adding a retry without making the operation idempotent, or moving the code into a transaction without checking whether the database is actually enforcing the business rule.

These fixes can make the incident harder to reproduce while leaving the core failure mode intact. That is one reason race conditions sometimes "go quiet" after a patch and then return weeks later under slightly different timing.

The durable fix is almost always the one that moves correctness into a boundary the system can actually enforce: a constraint, a lock, an atomic update, an idempotency layer, or a state machine that rejects invalid transitions.


The Main Ways To Prevent Race Conditions

1. Enforce invariants in the database

This is usually the strongest default.

Examples include unique constraints, foreign keys, check constraints, and conditional updates.

For stock control, this is safer than "read then subtract later":

UPDATE products
SET stock = stock - 1
WHERE id = $1
  AND stock > 0;

If the update affects 0 rows, the invariant blocked the oversell.

That is much safer than trusting application code to observe one stable moment in time.

2. Use optimistic locking when conflicts should fail fast

Optimistic locking works well when contention exists but is not constant, blocking would be expensive, and retries are acceptable.

The typical pattern is to read the row with a version, update it with WHERE version = ?, and retry or surface a conflict if no row changed.

Example:

UPDATE accounts
SET balance = balance - 100,
    version = version + 1
WHERE id = 42
  AND version = 7;

If zero rows update, another writer got there first.

For the full trade-off, see Optimistic vs Pessimistic Locking in SQL.

3. Use pessimistic locking when only one actor may proceed

Sometimes the simplest correct model is: one transaction gets exclusive access, the others wait or fail.

Typical example:

SELECT *
FROM jobs
WHERE id = $1
FOR UPDATE;

This is useful when duplicate success would be expensive, the invariant is strict, and contention is common.

The downside is obvious: more blocking, more contention, and more care needed to avoid deadlocks.

4. Add idempotency for retried actions

Many race-condition bugs are really retry bugs.

Examples include client retries after timeout, worker redelivery, double-submit from the UI, or proxy replay.

In those cases, the right protection is often not a lock. It is making repeated attempts map to one logical action.

This is especially important for payments, orders, subscriptions, webhook handlers, and async commands.

5. Make background handlers replay-safe

If a job might run twice, the handler must survive being run twice.

That usually means deduplication keys, explicit state transitions, idempotent side effects, and safe claim semantics.

If your queue or workflow crosses a write-and-publish boundary, the outbox pattern is also relevant. See Transactional Outbox Pattern in Microservices.


Pick The Boundary Before You Pick The Tool

The most useful practical move is to name the exact boundary where double success becomes unacceptable.

Maybe the boundary is "two subscriptions must never be created for one billing cycle." Maybe it is "the same inventory unit must never be sold twice." Maybe it is "the same webhook event must never produce two side effects." Once that sentence is clear, the tool choice gets easier because you are no longer solving "concurrency" in the abstract.

This also makes review and testing better. Reviewers can ask whether the chosen mechanism really protects that boundary, and tests can assert the invariant directly instead of only asserting that one happy-path request succeeds.


How To Choose The Right Protection

Use the invariant to pick the tool. If two actors must never both succeed, use a constraint or strict locking. If conflicts are acceptable but must be detected, use optimistic locking. If duplicates are caused by retries, add idempotency. If async handlers may repeat work, make the handler replay-safe. If correctness depends on durable event publication, add an outbox-style boundary.

Useful questions include what must never happen twice, what state must remain unique, whether two actors can safely succeed at the same time, whether waiting is acceptable or one side should fail fast, and whether retries will happen even when the first attempt already succeeded.

Once those answers are clear, the protection is usually much easier to choose.


How To Test For Race Conditions

Race conditions are easy to miss because most test suites run too sequentially.

Useful testing patterns include sending concurrent requests against the real endpoint, running the same workflow many times in parallel, verifying database state after all requests finish, exercising duplicate-delivery or retry scenarios, and testing both success and conflict outcomes.

For example, if an endpoint creates an order, do not only assert that one request succeeds. Also assert that two concurrent requests representing the same logical action do not create two orders.

This is one reason API integration tests matter so much for correctness-sensitive systems. See How to Write API Integration Tests.


Warning Signs You Already Have One

Common production symptoms include duplicate records that "should be impossible," occasional oversells or double-bookings, counters that drift under load, jobs processed twice after retries, bugs that disappear when stepping through slowly, or incidents that happen only at higher concurrency.

If the failure is intermittent, load-sensitive, and hard to reproduce locally, race conditions should be high on the suspect list.


What To Capture During A Race-Condition Incident

When you suspect a race, timestamps and identifiers matter more than raw log volume.

Useful evidence usually includes request IDs, actor or worker IDs, the shared business key involved, the observed order of reads and writes, retry attempts, and the final persistent state after the overlap.

That evidence helps you answer the only question that really matters: which two actors both believed they were allowed to proceed, and what system boundary failed to stop them?


A Practical Debugging Checklist

When you suspect a race condition:

  1. identify the invariant that failed
  2. find the exact read-check-write or side-effect boundary involved
  3. identify which actor can overlap with it
  4. check whether correctness currently depends only on application logic
  5. verify whether the database enforces the invariant directly
  6. review retries, duplicate delivery, and background processing paths
  7. decide whether the fix is a constraint, lock, idempotency layer, or workflow redesign

This matters because race conditions are rarely solved by "being more careful" in code review. They are solved by making the system correct even when timing is unfavorable.


Final Thoughts

Race conditions are not weird edge cases in backend systems. They are a normal consequence of shared state, retries, concurrency, and distributed failure boundaries.

The goal is not to make everything happen in perfect order. The goal is to make correctness independent of lucky timing.