
Background Jobs in Production
Background jobs in production are not just request handlers that run later.
Moving work to a queue can reduce user-facing latency, absorb bursts, and isolate slow dependencies. It also changes the failure model. A job can run twice. It can run late. It can partially succeed. It can be retried against state that changed after the original request. It can sit in a dead-letter queue while the main dashboard looks green.
The safe design question is not:
Did the job run?
It is:
Did the business workflow reach the correct durable state, even if the job was delayed, retried, duplicated, or replayed?
For the broader reliability cluster around retries, queues, cascading failures, and recovery patterns, see the Backend Reliability hub.
What Changes When Work Leaves The Request Path
Background processing usually starts with a reasonable move:
- send receipt emails after the response
- generate invoices asynchronously
- resize uploaded images in workers
- process webhook events outside the provider request
- publish analytics and search-index updates after commit
- run billing, reconciliation, or export jobs on a schedule
That lowers request latency. It does not remove correctness work.
Once work leaves the request path, production adds new questions:
- What if the job runs twice?
- What if it runs after the user changed the underlying record?
- What if the side effect succeeds but the worker crashes before recording completion?
- What if retries hammer a dependency that is already degraded?
- What if the queue is empty, but the workflow is still wrong?
- What if old jobs execute after a deploy changed the payload shape?
This is why background jobs are a backend reliability topic, not just an infrastructure convenience.
Assume At-Least-Once Delivery
Most production job systems prefer redelivery over silent loss.
Amazon SQS documents this explicitly for standard queues: messages can be delivered more than once, and applications should be idempotent when processing repeated messages. See the AWS documentation on SQS at-least-once delivery.
That model is usually the right trade-off. Losing work is often worse than repeating work.
But at-least-once delivery does not mean exactly-once side effects.
It does not guarantee:
- one email sent
- one external charge
- one invoice generated
- one webhook effect applied
- one search-index update
- one state transition
It means the worker may see the message again.
So the handler must be safe when it does.
A Concrete Failure: Duplicate Receipt Emails
Suppose a worker sends a receipt email after payment.
The handler looks straightforward:
export async function handleReceiptJob(job: ReceiptJob) {
const payment = await db.payment.findUnique({
where: { id: job.paymentId },
})
if (!payment) {
throw new Error('Payment not found')
}
await emailClient.send({
to: payment.customerEmail,
template: 'receipt',
data: { paymentId: payment.id },
})
await db.payment.update({
where: { id: payment.id },
data: { receiptSentAt: new Date() },
})
}
This code is reasonable when read sequentially.
It fails when the side effect succeeds and local completion does not.
10:14:02 payment succeeds
10:14:03 API enqueues send-receipt(payment_123)
10:14:04 worker A receives the job
10:14:05 email provider accepts the message
10:14:05 worker A crashes before writing receiptSentAt
10:14:35 job becomes visible again
10:14:36 worker B receives the same job
10:14:37 worker B sends the receipt again
From the queue's point of view, this is recovery. The message was not acknowledged, so it came back.
From the customer's point of view, the workflow is wrong.
The bug is not "the queue duplicated the job." The queue did what it was allowed to do. The bug is that the handler treated a retried command as a one-time action.
Design Jobs As Replayable Commands
A production-safe job should be designed as a replayable command.
That means the handler can answer:
- What business key identifies this logical work?
- Has the effect already happened?
- Is the previous attempt still in progress?
- If the previous attempt partially succeeded, what can the next attempt safely do?
- Which failures should retry?
- Which failures should stop and require review?
For the receipt job, the queue message ID is not the best business key. The business key is closer to:
payment_123 + receipt_email
That key can be stored in a delivery table:
CREATE TABLE email_deliveries (
payment_id uuid NOT NULL,
email_type text NOT NULL,
status text NOT NULL,
provider_message_id text,
attempts integer NOT NULL DEFAULT 0,
last_error text,
sent_at timestamptz,
updated_at timestamptz NOT NULL DEFAULT now(),
PRIMARY KEY (payment_id, email_type)
);
Now the handler has durable memory outside the queue:
payment_idandemail_typedefine the logical effectstatusexplains whether the work is pending, sent, failed, or under reviewattemptssupports retry policy and triageprovider_message_idsupports reconciliation with the email provider
The queue is the delivery mechanism. The delivery table is the business-state memory.
That distinction is what keeps retries from becoming duplicate side effects.
Use Explicit State Transitions
Background jobs become easier to reason about when the state machine is visible.
For a receipt email:
| State | Meaning | Next safe action |
|---|---|---|
pending | Work has been requested but not attempted | Worker may claim and send |
sending | Worker is currently attempting the side effect | Another worker should not send blindly |
sent | Provider accepted the message and local state recorded it | Retry should no-op or return success |
failed_retryable | Temporary failure occurred | Retry later with backoff |
needs_review | Outcome is ambiguous or repeatedly failing | Stop automatic retries |
This is stronger than checking receiptSentAt at the end of the handler.
The state machine lets a retry inspect the workflow before repeating the side effect. It also gives operators a language for manual repair. A job in needs_review is different from a job that is still waiting its turn.
For workflows that combine database writes with published messages, the Transactional Outbox Pattern solves a nearby boundary: making sure durable state and outgoing work do not drift apart.
Claim Work Safely
If multiple workers can process the same pool of work, claiming must be atomic.
A weak database-backed queue often does this:
SELECT id, payload
FROM jobs
WHERE status = 'queued'
ORDER BY run_at ASC
LIMIT 1;
Then the worker updates the selected row.
That can race. Two workers can select the same row before either marks it claimed.
A safer shape claims and returns work in one transaction:
WITH next_job AS (
SELECT id
FROM jobs
WHERE status = 'queued'
AND run_at <= now()
ORDER BY priority DESC, run_at ASC, id ASC
FOR UPDATE SKIP LOCKED
LIMIT 1
)
UPDATE jobs
SET status = 'running',
locked_by = $1,
locked_at = now(),
attempts = attempts + 1
WHERE id = (SELECT id FROM next_job)
RETURNING id, payload, attempts;
SKIP LOCKED is not the whole queue design, but it is a practical primitive for avoiding duplicate row claims under worker concurrency. The broader database-backed queue design is covered in PostgreSQL Job Queues with SKIP LOCKED.
For broker-backed queues, the mechanism may be a visibility timeout, lease, acknowledgment, or consumer group assignment. The principle is the same:
one worker should own an attempt, and the system should know when that ownership expired.
Set Visibility Timeouts From Real Handler Behavior
Visibility timeouts are one of the easiest queue settings to get wrong.
AWS describes the SQS visibility timeout as the period after a consumer receives a message when the message remains temporarily invisible to other consumers. If the consumer does not process and delete the message before the timeout expires, the message becomes visible again and can be processed by another consumer. See the AWS guide to SQS visibility timeout.
The production failure modes are direct:
| Timeout choice | Failure mode |
|---|---|
| Too short | Long-running jobs are picked up by another worker before the first attempt finishes |
| Too long | Crashed jobs take too long to retry |
| No extension/heartbeat | Legitimate long jobs become duplicate work |
| No maximum bound | Stuck work can remain hidden too long |
Set the timeout from measured handler duration, not hope.
For long-running work, prefer splitting the job into smaller steps or extending the lease deliberately while the worker is still healthy. A job that needs 40 minutes to finish may really be a workflow made of several shorter jobs.
Retry Policy Is Part Of Correctness
Retries are not just an error-handling detail.
They decide how much repeated work the system will generate under failure.
Use a retry table like this:
| Failure type | Example | Retry? | Policy |
|---|---|---|---|
| Transient dependency error | 503, timeout, connection reset | Yes | exponential backoff with jitter |
| Rate limit | 429 or quota error | Yes, carefully | respect provider retry hint and cap concurrency |
| Validation error | malformed payload | No | mark failed or send to review |
| Missing permanent record | deleted user, unknown payment | Usually no | stop and surface data issue |
| Ambiguous side effect | provider timed out after possible commit | Not blindly | reconcile before repeating |
| Code bug | handler throws same error every attempt | No after threshold | DLQ with triage |
The dangerous retry is the one that repeats a non-idempotent side effect because the local process did not receive a clean response.
This is the same overload pattern described in Adding Retries Can Make Outages Worse: when the dependency is degraded, retries can turn recovery into traffic amplification.
For background jobs, the retry policy should include:
- maximum attempts
- backoff and jitter
- retryable error classes
- permanent failure classes
- concurrency caps per dependency
- dead-letter threshold
- manual replay rules
Do not let "try again later" be the entire design.
Dead-Letter Queues Need A Triage Plan
A dead-letter queue is not a trash folder.
AWS describes DLQs as a way to isolate messages that were not processed successfully so teams can inspect logs, analyze message contents, and move messages out through redrive. See the AWS docs on SQS dead-letter queues.
That only helps if the team has an actual plan.
For each DLQ, define:
| Question | Good answer |
|---|---|
| Who owns this DLQ? | A named team or on-call rotation |
| What triggers an alert? | Count, age, or business priority threshold |
| What is safe to replay? | Only messages whose handler is replay-safe |
| What requires code change first? | Poison messages with deterministic failure |
| What context is preserved? | original payload, attempts, error class, timestamps, correlation ID |
| How is replay audited? | operator, time, reason, affected business keys |
The worst DLQ outcome is quiet accumulation. The second worst is blind redrive, where the same poison messages are pushed back into the main queue and cause the same failure again.
DLQ triage should decide whether to fix data, fix code, discard, replay, or escalate.
Reconciliation Is Not Optional For Important Workflows
Retries help with many transient failures. They do not catch every ambiguous outcome.
Important workflows need reconciliation.
Examples:
- payments marked paid without receipt delivery
- invoices generated but not sent
- shipments marked ready without carrier labels
- users charged but subscription state not activated
- exports completed but no notification sent
- webhook events stored but downstream work missing
Reconciliation asks:
What should exist if the workflow completed correctly, and what actually exists?
A simple reconciliation query might find paid payments without a sent receipt:
SELECT p.id
FROM payments p
LEFT JOIN email_deliveries d
ON d.payment_id = p.id
AND d.email_type = 'receipt'
AND d.status = 'sent'
WHERE p.status = 'paid'
AND d.payment_id IS NULL
AND p.paid_at < now() - interval '10 minutes';
That query is not a queue feature. It is a business-correctness feature.
It gives the system a second chance to notice work that fell between a queue retry, a worker crash, a deployment, or an ambiguous provider response.
Monitor Workflow Health, Not Just Queue Health
Queue dashboards are necessary but incomplete.
Infrastructure metrics tell you whether the worker system is moving:
| Metric | What it tells you |
|---|---|
| Queue depth | How much work is waiting |
| Oldest job age | Whether work is falling behind |
| Throughput | How many jobs complete per minute |
| Retry rate | Whether failures are rising |
| DLQ count and age | Whether failures are being isolated |
| Worker concurrency | Whether enough workers are active |
| Handler duration | Whether jobs are getting slower |
Business metrics tell you whether the workflow is correct:
| Metric | What it tells you |
|---|---|
| Duplicate side effects | Idempotency or replay safety is failing |
| Reconciliation mismatches | Queue motion did not produce expected outcome |
| Jobs completed without downstream proof | Completion state is too optimistic |
| Manual replay count | Automation is not recovering cleanly |
| Deduplication hits | Duplicate delivery is happening and being handled |
| Age of business workflow | User-visible work is stale even if workers are busy |
A queue can be empty because all work succeeded.
It can also be empty because jobs were dropped, dead-lettered, acknowledged too early, or marked complete without the side effect the business actually needed.
Measure both.
Test The Failure Modes
Background-job tests should not only prove the happy path.
They should prove replay safety.
Useful test cases:
- handler runs twice for the same business key
- worker crashes after side effect but before completion update
- dependency returns timeout after committing the side effect
- job is retried after a visibility timeout expires
- poison message moves to DLQ after the configured threshold
- delayed job runs after the underlying record changes
- manual replay does not duplicate the side effect
For the receipt email example:
emailClient.send.mockResolvedValueOnce({ providerMessageId: 'msg_123' })
await handleReceiptJob({ paymentId: 'pay_123' })
await handleReceiptJob({ paymentId: 'pay_123' })
expect(emailClient.send).toHaveBeenCalledTimes(1)
expect(await emailDeliveries.find('pay_123', 'receipt')).toMatchObject({
status: 'sent',
providerMessageId: 'msg_123',
})
The invariant is more important than the exact implementation:
one paid payment should produce one receipt delivery record and at most one provider send for that receipt.
For the broader shared-state risk, pair this with How to Prevent Race Conditions in Backend Systems.
Rollout Checklist For A New Background Job
Before shipping a production background job, check:
- What business key deduplicates the work?
- Can the handler run twice without duplicate side effects?
- Which state transition proves the job is complete?
- What happens if the side effect succeeds but local completion fails?
- Which failures are retryable, permanent, or ambiguous?
- What is the maximum attempt count?
- What backoff and jitter are used?
- Does the visibility timeout match measured handler duration?
- What sends a message to DLQ?
- Who reviews the DLQ?
- How are messages replayed safely?
- What business metric proves the workflow is correct?
- Is there a reconciliation job for important outcomes?
- Is the payload versioned enough to survive deploys?
- Are correlation IDs or business IDs preserved across the queue boundary?
This checklist is deliberately practical. Background jobs fail in production through small missing decisions, not because teams forgot what a queue is.
Final Takeaway
Background jobs are useful because they decouple slow or unreliable work from the request path.
They are dangerous when that decoupling is mistaken for correctness.
Production-safe background jobs need replay-safe handlers, durable business keys, explicit state transitions, bounded retries, visibility timeouts based on real execution, dead-letter triage, reconciliation, and dashboards that measure workflow correctness.
The queue can deliver work. The application still has to make the work safe.