
Webhook Idempotency and Retries in Production
Webhooks look simple until they meet production timing.
One service sends an event. Your endpoint receives it. You update local state and trigger side effects.
Then the sender retries the same event because your server responded too slowly, the network dropped the response, or their delivery system uses at-least-once semantics by default.
Now the same webhook may arrive twice, or three times, or out of order.
This is why webhook handling is not just an HTTP problem. It is a distributed-systems problem with real correctness risks:
- duplicate writes
- repeated side effects
- stale updates
- race conditions between deliveries
- confusing production incidents that only appear under load
If you treat every webhook delivery as brand-new work, retries will eventually create incorrect state.
Why Webhook Retries Are Normal
Many teams talk about duplicate webhooks as if they are rare edge cases. They are not.
Webhook providers retry for sensible reasons:
- your endpoint timed out
- your server returned
5xx - the network failed after your app already processed the event
- the provider intentionally guarantees at-least-once delivery
- the provider could not confirm whether delivery succeeded
From the sender's perspective, retrying is the reliable thing to do. From your perspective, that means duplicates are expected behavior.
This is the same underlying failure mode that appears in APIs when clients retry uncertain writes. I covered that pattern in more depth in Idempotency Keys for Duplicate API Requests.
What Makes Webhooks Tricky Compared to Regular API Requests
With a normal API request, your own client usually knows the intent and can attach an idempotency key.
With webhooks, you are the receiver. You do not control:
- the retry policy
- the delivery timing
- whether events arrive out of order
- how long the sender waits before retrying
- whether multiple deliveries race each other
That means your endpoint has to enforce correctness on its own.
The key mindset shift is this:
a webhook delivery is not proof that new work should happen
It is only evidence that an external system wants you to evaluate an event.
Your job is to determine:
- whether the event is authentic
- whether you have already processed it
- whether it is still relevant to current state
- whether downstream side effects can run safely
The Core Design Rule
Treat webhook processing as a pipeline:
- verify authenticity
- persist receipt
- deduplicate by event identity
- acknowledge quickly
- process asynchronously when possible
- make downstream side effects idempotent too
If any of those stages are weak, retries can still create duplicate work.
Step 1: Verify the Webhook Before You Trust It
The first job is not business logic. The first job is authentication.
Most serious webhook providers include a signature header derived from the raw request payload and a shared secret. Verify that signature before doing anything meaningful.
At minimum:
- use the provider's signing mechanism exactly as documented
- verify against the raw body, not a mutated JSON representation
- reject old timestamps when replay windows matter
- store enough metadata for incident debugging
If signature verification is wrong, two bad things happen:
- forged events may be accepted
- legitimate retries may look inconsistent because the payload was transformed before validation
This step protects security, but it also protects data quality. You should not deduplicate or process an event you cannot authenticate.
Step 2: Deduplicate Using a Stable Event Identifier
Most webhook providers send some event ID such as:
event_ididdelivery_idX-Request-Id
Use the provider's stable event identifier as your first deduplication key.
A common schema looks like this:
CREATE TABLE webhook_receipts (
id BIGSERIAL PRIMARY KEY,
provider TEXT NOT NULL,
event_id TEXT NOT NULL,
event_type TEXT NOT NULL,
received_at TIMESTAMPTZ NOT NULL DEFAULT now(),
processing_status TEXT NOT NULL,
payload JSONB NOT NULL,
UNIQUE (provider, event_id)
);
That uniqueness constraint is the important part. Without an atomic uniqueness check, two concurrent deliveries can both decide they are the first one.
This is the same pattern that protects API writes from duplicate retries: correctness depends on a single atomic reservation point, not an application-level if statement.
If you want the deeper reasoning behind this concurrency model, see How to Prevent Race Conditions in Backend Systems.
Step 3: Acknowledge Fast, Then Process Safely
One of the most common webhook mistakes is doing too much work inside the request handler.
Example failure pattern:
- webhook arrives
- endpoint verifies it
- endpoint performs multiple database writes
- endpoint calls other services
- endpoint sends email
- endpoint responds too slowly
- sender retries
- side effects happen twice
The safer pattern is usually:
- verify signature
- atomically record the event
- enqueue internal work
- return
2xxquickly
That reduces the chance that retries happen because your own processing path was slow.
It also isolates concerns:
- the webhook receiver handles authenticity and deduplication
- internal workers handle business processing
This model fits well with the same production discipline used for async jobs generally. If you want the broader version of that design, see Background Jobs in Production. If you need to trace one webhook delivery across receivers, queues, and workers later, Correlation IDs in Microservices is a useful complement.
Step 4: Make Business Processing Idempotent Too
Recording a webhook only solves ingress deduplication. It does not automatically make downstream work safe.
Suppose an event says:
invoice.paid
Your system might:
- mark an invoice as paid
- unlock account features
- send a receipt email
- publish an internal event
Even if the webhook event is stored once, retries or worker restarts can still cause repeated side effects unless the processing stage is also idempotent.
This is where teams often stop too early. They deduplicate the HTTP delivery but forget to deduplicate the state transition.
Safer examples:
- update invoice state only if current status is not already
paid - send email only if no receipt record exists for that invoice
- publish internal events through an outbox rather than inline dual writes
If your processing step must both mutate the database and publish an event, Transactional Outbox Pattern in Microservices is the safer next layer.
Step 5: Prepare for Out-of-Order Delivery
Not every webhook problem is a duplicate-delivery problem. Sometimes the issue is ordering.
Example:
subscription.updatedsubscription.created
If those arrive out of order and your code assumes sequence, you may overwrite newer state with older state.
This is why event processing should not rely only on arrival order.
Safer options include:
- compare event timestamps or provider version numbers
- reject stale updates when current state is newer
- fetch fresh canonical state from the provider for high-risk transitions
- model state transitions explicitly instead of blindly overwriting rows
This matters especially when webhooks feed shared database rows under concurrency. If two deliveries can update the same row at nearly the same time, row-conflict handling still matters. In those cases, Optimistic vs Pessimistic Locking in SQL becomes relevant too.
Step 6: Distinguish Event Identity from Business Identity
Some providers retry the same event with the same event ID. Others may emit different event IDs that still refer to the same business action.
That means event-level deduplication is necessary but not always sufficient.
You may need both:
- event identity deduplication: "have we seen this delivery before?"
- business identity deduplication: "have we already applied this real-world action?"
Examples:
- the same order should not be created twice
- the same payment should not be captured twice
- the same account should not be provisioned twice
In practice, business-level uniqueness often lives in your own domain tables:
- unique external payment ID
- unique provider subscription ID
- unique
(provider, external_object_id)constraint
This extra layer prevents mistakes when upstream systems emit semantically overlapping events.
A Practical Webhook Handler Flow
A safe high-level flow might look like this:
export async function handleWebhook(req, res) {
const rawBody = await readRawBody(req)
const signature = req.headers['provider-signature']
if (!verifySignature(rawBody, signature, process.env.WEBHOOK_SECRET)) {
res.statusCode = 401
res.end('invalid signature')
return
}
const event = JSON.parse(rawBody)
const inserted = await webhookReceipts.insertIfNew({
provider: 'stripe',
eventId: event.id,
eventType: event.type,
payload: event,
processingStatus: 'received',
})
if (!inserted) {
res.statusCode = 200
res.end('already processed')
return
}
await jobs.enqueue('process-webhook-event', {
provider: 'stripe',
eventId: event.id,
})
res.statusCode = 200
res.end('ok')
}
The key property is not the language or framework. The key property is that the deduplication write is atomic and happens before downstream work starts.
What Status Code Should You Return?
In general:
- return
2xxonly when the event is verified and safely recorded - return
4xxfor permanently invalid requests such as bad signatures - return
5xxonly when you want the provider to retry because processing could not be safely accepted
This is an important distinction.
If you already recorded the event durably, returning 5xx may just create noisy retries without improving correctness.
If you could not verify or persist the event safely, retrying may be appropriate.
The point is not to make retries disappear. The point is to make them harmless.
How to Test Webhooks Without Fooling Yourself
Webhook logic often looks correct in unit tests and still fails in production.
That happens because the hard bugs usually depend on:
- duplicate delivery
- concurrent delivery
- worker restarts
- out-of-order events
- partially completed side effects
Useful integration tests include:
- send the same event twice and assert one business effect
- send two concurrent deliveries with the same event ID
- simulate a worker crash after partial processing
- send newer and older events in reverse order
- verify signature rejection on altered payloads
This is exactly the kind of boundary where end-to-end behavior matters more than isolated unit logic. I covered the broader testing approach in How to Write API Integration Tests.
Common Webhook Mistakes
These mistakes appear often in production systems:
Processing inside the request path
This increases timeout risk and invites retries.
Using no durable deduplication store
An in-memory cache is usually not enough if multiple instances can receive the same event.
Trusting arrival order
Event delivery order is often weaker than teams assume.
Making only the receiver idempotent
Downstream workers and side effects still need protection.
Using payload hashes instead of provider event IDs as the only dedupe key
Hashes can help, but provider identity is usually the cleaner first key when available.
Returning 200 before the event is safely recorded
If you acknowledge before durable receipt, you can lose events silently.
When You Need More Than Simple Deduplication
Some webhook flows are simple enough for a single receipts table and a worker queue.
Others need stronger coordination:
- multiple event types update the same aggregate
- processing fans out to several services
- side effects must be published reliably
- reprocessing and replay tooling is required
At that point, webhook handling becomes part of your broader event-processing architecture. You may need:
- inbox/outbox tables
- replay tools
- explicit state machines
- dead-letter handling
- operational dashboards for stuck events
That is normal. As systems grow, webhook correctness stops being a tiny integration detail and becomes production infrastructure.
Final Principle
The most important webhook idea is simple:
retries are not bugs, but duplicate side effects are
Do not design for a world where each webhook arrives once. Design for the real one:
- delivery may repeat
- delivery may race
- delivery may arrive late
- delivery may arrive out of order
If your system verifies authenticity, deduplicates atomically, acknowledges quickly, and makes business processing idempotent, retries become routine instead of dangerous.
That is what "safe webhook handling" really means in production.