Webhook Idempotency and Retries in Production

Webhook Idempotency and Retries in Production

Webhooks look simple until they meet production timing.

One service sends an event. Your endpoint receives it. You update local state and trigger side effects.

Then the sender retries the same event because your server responded too slowly, the network dropped the response, or their delivery system uses at-least-once semantics by default.

Now the same webhook may arrive twice, or three times, or out of order.

This is why webhook handling is not just an HTTP problem. It is a distributed-systems problem with real correctness risks:

  • duplicate writes
  • repeated side effects
  • stale updates
  • race conditions between deliveries
  • confusing production incidents that only appear under load

If you treat every webhook delivery as brand-new work, retries will eventually create incorrect state.

Why Webhook Retries Are Normal

Many teams talk about duplicate webhooks as if they are rare edge cases. They are not.

Webhook providers retry for sensible reasons:

  • your endpoint timed out
  • your server returned 5xx
  • the network failed after your app already processed the event
  • the provider intentionally guarantees at-least-once delivery
  • the provider could not confirm whether delivery succeeded

From the sender's perspective, retrying is the reliable thing to do. From your perspective, that means duplicates are expected behavior.

This is the same underlying failure mode that appears in APIs when clients retry uncertain writes. I covered that pattern in more depth in Idempotency Keys for Duplicate API Requests.


What Makes Webhooks Tricky Compared to Regular API Requests

With a normal API request, your own client usually knows the intent and can attach an idempotency key.

With webhooks, you are the receiver. You do not control:

  • the retry policy
  • the delivery timing
  • whether events arrive out of order
  • how long the sender waits before retrying
  • whether multiple deliveries race each other

That means your endpoint has to enforce correctness on its own.

The key mindset shift is this:

a webhook delivery is not proof that new work should happen

It is only evidence that an external system wants you to evaluate an event.

Your job is to determine:

  1. whether the event is authentic
  2. whether you have already processed it
  3. whether it is still relevant to current state
  4. whether downstream side effects can run safely

The Core Design Rule

Treat webhook processing as a pipeline:

  1. verify authenticity
  2. persist receipt
  3. deduplicate by event identity
  4. acknowledge quickly
  5. process asynchronously when possible
  6. make downstream side effects idempotent too

If any of those stages are weak, retries can still create duplicate work.


Step 1: Verify the Webhook Before You Trust It

The first job is not business logic. The first job is authentication.

Most serious webhook providers include a signature header derived from the raw request payload and a shared secret. Verify that signature before doing anything meaningful.

At minimum:

  • use the provider's signing mechanism exactly as documented
  • verify against the raw body, not a mutated JSON representation
  • reject old timestamps when replay windows matter
  • store enough metadata for incident debugging

If signature verification is wrong, two bad things happen:

  • forged events may be accepted
  • legitimate retries may look inconsistent because the payload was transformed before validation

This step protects security, but it also protects data quality. You should not deduplicate or process an event you cannot authenticate.


Step 2: Deduplicate Using a Stable Event Identifier

Most webhook providers send some event ID such as:

  • event_id
  • id
  • delivery_id
  • X-Request-Id

Use the provider's stable event identifier as your first deduplication key.

A common schema looks like this:

CREATE TABLE webhook_receipts (
  id BIGSERIAL PRIMARY KEY,
  provider TEXT NOT NULL,
  event_id TEXT NOT NULL,
  event_type TEXT NOT NULL,
  received_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  processing_status TEXT NOT NULL,
  payload JSONB NOT NULL,
  UNIQUE (provider, event_id)
);

That uniqueness constraint is the important part. Without an atomic uniqueness check, two concurrent deliveries can both decide they are the first one.

This is the same pattern that protects API writes from duplicate retries: correctness depends on a single atomic reservation point, not an application-level if statement.

If you want the deeper reasoning behind this concurrency model, see How to Prevent Race Conditions in Backend Systems.


Step 3: Acknowledge Fast, Then Process Safely

One of the most common webhook mistakes is doing too much work inside the request handler.

Example failure pattern:

  1. webhook arrives
  2. endpoint verifies it
  3. endpoint performs multiple database writes
  4. endpoint calls other services
  5. endpoint sends email
  6. endpoint responds too slowly
  7. sender retries
  8. side effects happen twice

The safer pattern is usually:

  1. verify signature
  2. atomically record the event
  3. enqueue internal work
  4. return 2xx quickly

That reduces the chance that retries happen because your own processing path was slow.

It also isolates concerns:

  • the webhook receiver handles authenticity and deduplication
  • internal workers handle business processing

This model fits well with the same production discipline used for async jobs generally. If you want the broader version of that design, see Background Jobs in Production. If you need to trace one webhook delivery across receivers, queues, and workers later, Correlation IDs in Microservices is a useful complement.


Step 4: Make Business Processing Idempotent Too

Recording a webhook only solves ingress deduplication. It does not automatically make downstream work safe.

Suppose an event says:

invoice.paid

Your system might:

  • mark an invoice as paid
  • unlock account features
  • send a receipt email
  • publish an internal event

Even if the webhook event is stored once, retries or worker restarts can still cause repeated side effects unless the processing stage is also idempotent.

This is where teams often stop too early. They deduplicate the HTTP delivery but forget to deduplicate the state transition.

Safer examples:

  • update invoice state only if current status is not already paid
  • send email only if no receipt record exists for that invoice
  • publish internal events through an outbox rather than inline dual writes

If your processing step must both mutate the database and publish an event, Transactional Outbox Pattern in Microservices is the safer next layer.


Step 5: Prepare for Out-of-Order Delivery

Not every webhook problem is a duplicate-delivery problem. Sometimes the issue is ordering.

Example:

  1. subscription.updated
  2. subscription.created

If those arrive out of order and your code assumes sequence, you may overwrite newer state with older state.

This is why event processing should not rely only on arrival order.

Safer options include:

  • compare event timestamps or provider version numbers
  • reject stale updates when current state is newer
  • fetch fresh canonical state from the provider for high-risk transitions
  • model state transitions explicitly instead of blindly overwriting rows

This matters especially when webhooks feed shared database rows under concurrency. If two deliveries can update the same row at nearly the same time, row-conflict handling still matters. In those cases, Optimistic vs Pessimistic Locking in SQL becomes relevant too.


Step 6: Distinguish Event Identity from Business Identity

Some providers retry the same event with the same event ID. Others may emit different event IDs that still refer to the same business action.

That means event-level deduplication is necessary but not always sufficient.

You may need both:

  • event identity deduplication: "have we seen this delivery before?"
  • business identity deduplication: "have we already applied this real-world action?"

Examples:

  • the same order should not be created twice
  • the same payment should not be captured twice
  • the same account should not be provisioned twice

In practice, business-level uniqueness often lives in your own domain tables:

  • unique external payment ID
  • unique provider subscription ID
  • unique (provider, external_object_id) constraint

This extra layer prevents mistakes when upstream systems emit semantically overlapping events.


A Practical Webhook Handler Flow

A safe high-level flow might look like this:

export async function handleWebhook(req, res) {
  const rawBody = await readRawBody(req)
  const signature = req.headers['provider-signature']

  if (!verifySignature(rawBody, signature, process.env.WEBHOOK_SECRET)) {
    res.statusCode = 401
    res.end('invalid signature')
    return
  }

  const event = JSON.parse(rawBody)

  const inserted = await webhookReceipts.insertIfNew({
    provider: 'stripe',
    eventId: event.id,
    eventType: event.type,
    payload: event,
    processingStatus: 'received',
  })

  if (!inserted) {
    res.statusCode = 200
    res.end('already processed')
    return
  }

  await jobs.enqueue('process-webhook-event', {
    provider: 'stripe',
    eventId: event.id,
  })

  res.statusCode = 200
  res.end('ok')
}

The key property is not the language or framework. The key property is that the deduplication write is atomic and happens before downstream work starts.


What Status Code Should You Return?

In general:

  • return 2xx only when the event is verified and safely recorded
  • return 4xx for permanently invalid requests such as bad signatures
  • return 5xx only when you want the provider to retry because processing could not be safely accepted

This is an important distinction.

If you already recorded the event durably, returning 5xx may just create noisy retries without improving correctness.

If you could not verify or persist the event safely, retrying may be appropriate.

The point is not to make retries disappear. The point is to make them harmless.


How to Test Webhooks Without Fooling Yourself

Webhook logic often looks correct in unit tests and still fails in production.

That happens because the hard bugs usually depend on:

  • duplicate delivery
  • concurrent delivery
  • worker restarts
  • out-of-order events
  • partially completed side effects

Useful integration tests include:

  1. send the same event twice and assert one business effect
  2. send two concurrent deliveries with the same event ID
  3. simulate a worker crash after partial processing
  4. send newer and older events in reverse order
  5. verify signature rejection on altered payloads

This is exactly the kind of boundary where end-to-end behavior matters more than isolated unit logic. I covered the broader testing approach in How to Write API Integration Tests.


Common Webhook Mistakes

These mistakes appear often in production systems:

Processing inside the request path

This increases timeout risk and invites retries.

Using no durable deduplication store

An in-memory cache is usually not enough if multiple instances can receive the same event.

Trusting arrival order

Event delivery order is often weaker than teams assume.

Making only the receiver idempotent

Downstream workers and side effects still need protection.

Using payload hashes instead of provider event IDs as the only dedupe key

Hashes can help, but provider identity is usually the cleaner first key when available.

Returning 200 before the event is safely recorded

If you acknowledge before durable receipt, you can lose events silently.


When You Need More Than Simple Deduplication

Some webhook flows are simple enough for a single receipts table and a worker queue.

Others need stronger coordination:

  • multiple event types update the same aggregate
  • processing fans out to several services
  • side effects must be published reliably
  • reprocessing and replay tooling is required

At that point, webhook handling becomes part of your broader event-processing architecture. You may need:

  • inbox/outbox tables
  • replay tools
  • explicit state machines
  • dead-letter handling
  • operational dashboards for stuck events

That is normal. As systems grow, webhook correctness stops being a tiny integration detail and becomes production infrastructure.


Final Principle

The most important webhook idea is simple:

retries are not bugs, but duplicate side effects are

Do not design for a world where each webhook arrives once. Design for the real one:

  • delivery may repeat
  • delivery may race
  • delivery may arrive late
  • delivery may arrive out of order

If your system verifies authenticity, deduplicates atomically, acknowledges quickly, and makes business processing idempotent, retries become routine instead of dangerous.

That is what "safe webhook handling" really means in production.