Webhook Idempotency and Retries in Production

Webhook Idempotency and Retries in Production

Webhook idempotency matters because production webhook delivery is not a clean "one event, one request, one side effect" story.

A provider can retry after a timeout, redeliver after a network failure, send the same event while your first handler is still running, or deliver events in an order your code did not expect.

The receiver has to make that normal behavior safe.

If a duplicate invoice.paid webhook sends two receipt emails, unlocks an account twice, publishes two internal events, or overwrites newer subscription state with older data, the bug is not that the provider retried. The bug is that the receiver treated delivery as proof that new work should happen.

This article is part of the API Correctness hub because webhook retries are the receiver-side version of the same duplicate-processing problem covered in API Idempotency Keys: Prevent Duplicate Requests Safely. The difference is control: with API idempotency keys, your client sends the key. With webhooks, the external provider controls delivery and the receiver must protect itself.


Why Webhook Retries Are Normal

Webhook providers retry because they cannot always know whether your system safely accepted an event.

Your handler might commit a database transaction and then time out before the response reaches the provider. The provider sees an uncertain delivery and tries again. From its side, retrying is the reliable choice.

Provider behavior also differs. Stripe documents automatic webhook retry behavior, duplicate event handling, and the fact that events are not guaranteed to arrive in creation order. Its docs recommend using event IDs and, in some cases, the object ID plus event type to detect duplicates. See Stripe's webhook documentation.

GitHub's webhook docs recommend responding with a 2xx status within 10 seconds and processing longer work asynchronously. GitHub also documents the X-GitHub-Delivery header as a stable delivery identifier for redelivery protection. See GitHub's webhook best practices.

Those details matter because "retry" is not one behavior. Some providers automatically retry. Some rely on manual redelivery. Some provide event IDs. Some provide delivery IDs. Some can emit two different events for the same business object.

The receiver design should not depend on the optimistic case.


The Failure Timeline

A duplicate side effect usually starts with a reasonable handler:

1. Provider sends invoice.paid
2. Handler verifies the signature
3. Handler marks invoice paid
4. Handler sends a receipt email
5. Handler publishes an internal invoice-paid event
6. Handler responds too slowly or the response is lost
7. Provider retries invoice.paid
8. Handler repeats the same side effects

The code may look correct in a local test because the local test sends one request and receives one response.

Production adds the missing states:

Production stateWhy it breaks naive handlers
response lost after local commitprovider retries work that already changed state
concurrent duplicate deliveryboth handlers pass an application-level check
worker crash after partial workreplay can repeat side effects
older event arrives laterstale data can overwrite newer local state
provider emits related eventsevent-level dedupe may not dedupe business effects

Webhook correctness starts when the handler assumes these states will happen.


Use A Receiver Pipeline, Not One Big Handler

A production webhook receiver should usually be split into two parts:

  1. a fast HTTP receiver that verifies, records, deduplicates, enqueues, and acknowledges
  2. an internal processor that applies business changes with idempotent state transitions

The receiver answers a narrow question:

Is this authentic event safely recorded for internal processing?

The processor answers the business question:

Has this real-world effect already been applied, and is it still valid against current state?

Keeping those questions separate avoids the most common mistake: doing slow business work before the provider has a clear acknowledgement.


Step 1: Verify The Raw Webhook First

Do not parse, trust, enqueue, or deduplicate a webhook before verifying that it came from the provider.

Most serious providers sign the raw request body with a shared secret. Stripe's signature docs are explicit that the raw body must be used for verification, because framework middleware can change whitespace, encoding, key order, or JSON structure before your code sees it. See Stripe's webhook signature documentation.

A safe receiver usually needs:

RequirementWhy it matters
raw body accesssignature verification often depends on exact bytes
provider signature headerrejects forged events
endpoint secret isolationavoids mixing test, CLI, staging, and production
timestamp or replay windowlimits accepted old deliveries when provider allows
event type allowlistavoids processing noisy events the system does not use

The first durable record should happen only after the request is authentic enough to keep.


Step 2: Store A Durable Receipt Before Side Effects

The receiver needs one durable reservation point for the provider event.

For many systems, that is a webhook_receipts table:

CREATE TABLE webhook_receipts (
  id BIGSERIAL PRIMARY KEY,
  provider TEXT NOT NULL,
  event_id TEXT NOT NULL,
  event_type TEXT NOT NULL,
  object_id TEXT,
  received_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  status TEXT NOT NULL CHECK (
    status IN ('received', 'processing', 'processed', 'retrying', 'failed')
  ),
  attempts INTEGER NOT NULL DEFAULT 0,
  next_attempt_at TIMESTAMPTZ,
  processed_at TIMESTAMPTZ,
  last_error TEXT,
  payload JSONB NOT NULL,
  UNIQUE (provider, event_id)
);

The UNIQUE (provider, event_id) constraint is the correctness boundary.

An application-level check like "look up the event, then insert if missing" is not enough under concurrency. Two duplicate deliveries can both observe "not found" before either inserts. The database constraint makes the reservation atomic.

In PostgreSQL, INSERT ... ON CONFLICT is the usual tool for this shape. The PostgreSQL docs describe ON CONFLICT DO NOTHING and ON CONFLICT DO UPDATE as conflict-handling options for unique or exclusion constraint violations. See the PostgreSQL INSERT documentation.

The insert should be boring and decisive:

INSERT INTO webhook_receipts (
  provider,
  event_id,
  event_type,
  object_id,
  status,
  payload
)
VALUES (
  $1,
  $2,
  $3,
  $4,
  'received',
  $5
)
ON CONFLICT (provider, event_id) DO NOTHING
RETURNING id;

If the insert returns an ID, this receiver is the first accepted delivery.

If it returns nothing, the event was already recorded. That duplicate should usually receive a 2xx response, assuming the original record is handled by your internal processing and retry system.

For the broader race-condition reasoning behind this pattern, see How to Prevent Race Conditions in Backend Systems.


Step 3: Acknowledge After Receipt, Not After All Work

The provider does not need your full business workflow to finish before it receives an acknowledgement.

It needs to know whether the event was accepted.

A practical HTTP handler often looks like this:

webhook-handler.ts
export async function handleWebhook(req: Request) {
  const rawBody = await req.text()
  const signature = req.headers.get('provider-signature')

  const event = verifyAndParseWebhook(rawBody, signature)

  if (!event.ok) {
    return new Response('invalid signature', { status: 401 })
  }

  const receipt = await db.oneOrNone(
    `
      INSERT INTO webhook_receipts (
        provider,
        event_id,
        event_type,
        object_id,
        status,
        payload
      )
      VALUES ($1, $2, $3, $4, 'received', $5)
      ON CONFLICT (provider, event_id) DO NOTHING
      RETURNING id
    `,
    [
      event.provider,
      event.id,
      event.type,
      event.objectId,
      event.payload,
    ]
  )

  if (!receipt) {
    return new Response('already accepted', { status: 200 })
  }

  await webhookQueue.enqueue({
    receiptId: receipt.id,
    provider: event.provider,
    eventId: event.id,
  })

  return new Response('accepted', { status: 202 })
}

This handler does not send emails, unlock features, call other services, or publish business events inline.

It verifies the request, stores the event once, schedules internal work, and returns quickly.

That is also why the internal queue or job system matters. If you return 2xx after writing the receipt, your own system now owns processing. The event must be retried internally when workers crash, dependencies fail, or business logic throws.

For the wider operational model, see Background Jobs in Production.


Step 4: Make The Worker Idempotent Too

The receipt table prevents the same provider event from being accepted twice.

It does not automatically make business processing safe.

Suppose the event is invoice.paid. The worker might:

  • mark the invoice paid
  • unlock account features
  • send a receipt email
  • publish an internal event
  • update analytics

Any of those steps can be repeated by an internal retry unless the worker is also idempotent.

The business transition should be conditional:

UPDATE invoices
SET
  status = 'paid',
  paid_at = COALESCE(paid_at, $3)
WHERE provider = $1
  AND provider_invoice_id = $2
  AND status <> 'paid'
RETURNING id;

The receipt email should also have a durable "send once" boundary:

CREATE TABLE invoice_receipt_emails (
  id BIGSERIAL PRIMARY KEY,
  invoice_id BIGINT NOT NULL REFERENCES invoices (id),
  provider_event_id TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  sent_at TIMESTAMPTZ,
  UNIQUE (invoice_id)
);

Now a retried worker can try to create the email record again without sending a second receipt.

If the worker must both update local state and publish a message, do not hide that dual-write risk inside the webhook handler. Use a durable outbox row created in the same database transaction as the business update, then let an outbox relay publish it. That design is covered in Transactional Outbox Pattern in Microservices.


Step 5: Separate Event Identity From Business Identity

Event-level deduplication answers this question:

Have we accepted this provider event ID before?

Business-level idempotency answers a different question:

Have we already applied this real-world effect?

You often need both.

IdentityExampleWhat it protects
provider event IDevt_123duplicate delivery of the same event
provider delivery IDX-GitHub-Delivery valueredelivery of the same delivery record
provider object ID plus typeinvoice_123 and invoice.paidseparate events that describe the same effect
local domain IDinternal invoice.idrepeated local processing and worker retries
effect IDreceipt-email:invoice_123repeated side effects

Stripe's docs explicitly call out this distinction: logging event IDs handles duplicated event objects, but some duplicates require combining the object ID in data.object with the event.type.

That is why a single webhook_receipts table is necessary but not always sufficient.

For payment events, the domain table might enforce uniqueness on provider payment IDs. For subscription events, state transitions might compare provider timestamps or versions. For emails, a unique effect key might be the safest boundary.

The durable rule should live where the side effect happens.


Step 6: Do Not Trust Arrival Order

Some webhook handlers fail because they assume the newest request is the newest fact.

That is not guaranteed.

Stripe documents that events are not guaranteed to arrive in the order they were generated. Other providers have similar caveats or leave ordering unspecified across retries, partitions, and manual redelivery.

The receiver should treat arrival time as evidence, not truth.

Event scenarioUnsafe behaviorSafer behavior
subscription.updated before createfail because local row is missingfetch canonical state or create a pending record
older status after newer statusoverwrite current state blindlycompare provider timestamp, version, or transition
invoice.paid before local invoicediscard the eventstore receipt and retry after local sync completes
manual redelivery after recoveryrepeat side effectsreuse receipt and business-effect uniqueness

For high-risk transitions, it can be safer to fetch the canonical object from the provider before applying local state.

That adds an API call, latency, and dependency on provider availability. It is not free. But for account access, billing status, entitlement changes, and compliance-sensitive workflows, confirming current provider state can be safer than trusting an isolated event that arrived late.


Step 7: Decide Which Status Codes Mean Retry

Webhook status codes are part of the retry contract.

Use them deliberately:

ResponseUse when
2xxevent is verified and durably accepted, or known duplicate
400 or 401signature, payload, or authorization is permanently invalid
404endpoint path is wrong or no longer valid
409rarely useful for providers unless documented
5xxreceiver cannot safely verify, record, or accept the event

The dangerous response is 200 before the event is durably recorded.

If the process crashes after returning success but before writing the receipt, the provider may stop retrying and your system has no event to process.

The other dangerous response is 5xx after the event is already recorded. That can create noisy provider retries without adding safety, especially if your internal worker retry loop already owns processing.

The receiver should be honest:

  • "I could not safely accept this" means return a retryable failure.
  • "I accepted this and will process it internally" means return success.

Step 8: Build Replay And Dead-Letter Handling

Webhook processing needs a way to recover from bad payloads, temporary dependency failures, and code bugs.

The receipt table should support internal retry and replay:

UPDATE webhook_receipts
SET
  status = 'retrying',
  attempts = attempts + 1,
  next_attempt_at = now() + interval '5 minutes',
  last_error = $2
WHERE id = $1;

A minimal worker lifecycle might be:

received -> processing -> processed
received -> processing -> retrying -> processing
received -> processing -> failed
failed   -> received   -> processing   (manual replay)

Do not rely only on provider redelivery after the receipt has been accepted.

Once your receiver returns 2xx, your system needs its own operational tools:

ToolWhy it matters
retry schedulehandles temporary downstream failures
max attemptsstops poison events from looping forever
dead-letter statepreserves failed payloads for investigation
manual replaylets fixed code reprocess known failed events
processing lag metricshows accepted events stuck behind workers
duplicate counterreveals provider retries or network instability

Without this layer, "ack fast" becomes "drop work faster."


Step 9: Test The States That Production Adds

Webhook tests should not only check the happy path.

They should prove that duplicate delivery, concurrency, partial failure, and ordering do not create duplicate effects.

Useful tests include:

TestWhat it proves
same event sent twiceone receipt and one business effect
same event sent concurrentlyunique constraint handles the race
worker crashes after state updateretry does not repeat side effects
older event arrives after newer eventlocal state is not downgraded
invalid signature with valid JSON bodyunauthenticated payload is rejected before work
receipt exists but worker failedprovider duplicate returns 2xx, internal retry continues
manual replay of failed receiptreplay is deliberate and observable

The test should assert durable state, not only HTTP status.

For example, after sending a duplicate payment event, assert that there is one receipt row, one paid invoice transition, one receipt-email record, and one outbox event.

That is the difference between testing the route and testing the correctness boundary.


Operational Signals To Watch

Webhook systems usually fail quietly before they fail loudly.

The receiver might still return 202 while workers are stuck, duplicates are rising, or stale events are being ignored.

Useful metrics include:

MetricWhy it matters
received events by providertraffic shape and provider-side changes
duplicate event countretry spikes, network issues, or slow receiver
signature failure countbad secrets, attacks, or provider config issues
accept latencyrisk of provider timeout
processing lagworkers falling behind accepted events
retrying receipt countdependency or handler instability
failed receipt countpoison payloads or unsupported event versions
stale event rejection countordering and replay behavior
side-effect conflict countbusiness idempotency doing real work

Log the identifiers engineers need during an incident:

provider=stripe
event_id=evt_123
event_type=invoice.paid
object_id=in_456
receipt_id=99182
status=processed
correlation_id=checkout-abc

Do not put secrets or full sensitive payloads into logs. Store full payloads only where access control, retention, and privacy rules are intentional.


A Practical Checklist

Before trusting a webhook receiver in production, check:

  • signatures are verified against the raw body before parsing or storing
  • accepted events are recorded in a durable table
  • provider event IDs are protected by a database uniqueness constraint
  • duplicates receive 2xx only after the original event is safely owned internally
  • provider acknowledgement happens before slow business work
  • internal workers can retry accepted events without provider help
  • business state transitions are conditional and replay-safe
  • repeated side effects have their own unique effect keys
  • out-of-order events cannot blindly overwrite newer state
  • failed events can be inspected, retried, and replayed deliberately
  • tests cover duplicate, concurrent, partial-failure, and stale-event cases
  • metrics distinguish provider delivery, receiver acceptance, and business processing

The important idea is not that every webhook system needs a huge framework.

The important idea is that each boundary has a clear owner:

provider retry -> receiver receipt -> internal processing -> business effect

Each handoff needs its own idempotency rule.


Final Takeaway

Webhook retries are not bugs.

Duplicate side effects are bugs.

A reliable webhook receiver verifies the raw request, records the event once, acknowledges only after durable acceptance, processes business changes asynchronously, and makes every meaningful side effect replay-safe.

Once that design is in place, provider retries become ordinary delivery behavior instead of a production incident waiting for the next timeout.