Background Jobs in Production

Background jobs in production are not just request handlers that run later.

Moving work to a queue can reduce user-facing latency, absorb bursts, and isolate slow dependencies. It also changes the failure model. A job can run twice. It can run late. It can partially succeed. It can be retried against state that changed after the original request. It can sit in a dead-letter queue while the main dashboard looks green.

The safe design question is not:

Did the job run?

It is:

Did the business workflow reach the correct durable state, even if the job was delayed, retried, duplicated, or replayed?

For the broader reliability cluster around retries, queues, cascading failures, and recovery patterns, see the Backend Reliability hub.

What Changes When Work Leaves The Request Path

Background processing usually starts with a reasonable move:

send receipt emails after the response
generate invoices asynchronously
resize uploaded images in workers
process webhook events outside the provider request
publish analytics and search-index updates after commit
run billing, reconciliation, or export jobs on a schedule

That lowers request latency. It does not remove correctness work.

Once work leaves the request path, production adds new questions:

What if the job runs twice?
What if it runs after the user changed the underlying record?
What if the side effect succeeds but the worker crashes before recording completion?
What if retries hammer a dependency that is already degraded?
What if the queue is empty, but the workflow is still wrong?
What if old jobs execute after a deploy changed the payload shape?

This is why background jobs are a backend reliability topic, not just an infrastructure convenience.

Assume At-Least-Once Delivery

Most production job systems prefer redelivery over silent loss.

Amazon SQS documents this explicitly for standard queues: messages can be delivered more than once, and applications should be idempotent when processing repeated messages. See the AWS documentation on SQS at-least-once delivery.

That model is usually the right trade-off. Losing work is often worse than repeating work.

But at-least-once delivery does not mean exactly-once side effects.

It does not guarantee:

one email sent
one external charge
one invoice generated
one webhook effect applied
one search-index update
one state transition

It means the worker may see the message again.

So the handler must be safe when it does.

A Concrete Failure: Duplicate Receipt Emails

Suppose a worker sends a receipt email after payment.

The handler looks straightforward:

export async function handleReceiptJob(job: ReceiptJob) {
  const payment = await db.payment.findUnique({
    where: { id: job.paymentId },
  })

  if (!payment) {
    throw new Error('Payment not found')
  }

  await emailClient.send({
    to: payment.customerEmail,
    template: 'receipt',
    data: { paymentId: payment.id },
  })

  await db.payment.update({
    where: { id: payment.id },
    data: { receiptSentAt: new Date() },
  })
}

This code is reasonable when read sequentially.

It fails when the side effect succeeds and local completion does not.

10:14:02 payment succeeds
10:14:03 API enqueues send-receipt(payment_123)
10:14:04 worker A receives the job
10:14:05 email provider accepts the message
10:14:05 worker A crashes before writing receiptSentAt
10:14:35 job becomes visible again
10:14:36 worker B receives the same job
10:14:37 worker B sends the receipt again

From the queue's point of view, this is recovery. The message was not acknowledged, so it came back.

From the customer's point of view, the workflow is wrong.

The bug is not "the queue duplicated the job." The queue did what it was allowed to do. The bug is that the handler treated a retried command as a one-time action.

Design Jobs As Replayable Commands

A production-safe job should be designed as a replayable command.

That means the handler can answer:

What business key identifies this logical work?
Has the effect already happened?
Is the previous attempt still in progress?
If the previous attempt partially succeeded, what can the next attempt safely do?
Which failures should retry?
Which failures should stop and require review?

For the receipt job, the queue message ID is not the best business key. The business key is closer to:

payment_123 + receipt_email

That key can be stored in a delivery table:

CREATE TABLE email_deliveries (
  payment_id uuid NOT NULL,
  email_type text NOT NULL,
  status text NOT NULL,
  provider_message_id text,
  attempts integer NOT NULL DEFAULT 0,
  last_error text,
  sent_at timestamptz,
  updated_at timestamptz NOT NULL DEFAULT now(),
  PRIMARY KEY (payment_id, email_type)
);

Now the handler has durable memory outside the queue:

payment_id and email_type define the logical effect
status explains whether the work is pending, sent, failed, or under review
attempts supports retry policy and triage
provider_message_id supports reconciliation with the email provider

The queue is the delivery mechanism. The delivery table is the business-state memory.

That distinction is what keeps retries from becoming duplicate side effects.

Use Explicit State Transitions

Background jobs become easier to reason about when the state machine is visible.

For a receipt email:

State	Meaning	Next safe action
`pending`	Work has been requested but not attempted	Worker may claim and send
`sending`	Worker is currently attempting the side effect	Another worker should not send blindly
`sent`	Provider accepted the message and local state recorded it	Retry should no-op or return success
`failed_retryable`	Temporary failure occurred	Retry later with backoff
`needs_review`	Outcome is ambiguous or repeatedly failing	Stop automatic retries

This is stronger than checking receiptSentAt at the end of the handler.

The state machine lets a retry inspect the workflow before repeating the side effect. It also gives operators a language for manual repair. A job in needs_review is different from a job that is still waiting its turn.

For workflows that combine database writes with published messages, the Transactional Outbox Pattern solves a nearby boundary: making sure durable state and outgoing work do not drift apart.

Claim Work Safely

If multiple workers can process the same pool of work, claiming must be atomic.

A weak database-backed queue often does this:

SELECT id, payload
FROM jobs
WHERE status = 'queued'
ORDER BY run_at ASC
LIMIT 1;

Then the worker updates the selected row.

That can race. Two workers can select the same row before either marks it claimed.

A safer shape claims and returns work in one transaction:

WITH next_job AS (
  SELECT id
  FROM jobs
  WHERE status = 'queued'
    AND run_at <= now()
  ORDER BY priority DESC, run_at ASC, id ASC
  FOR UPDATE SKIP LOCKED
  LIMIT 1
)
UPDATE jobs
SET status = 'running',
    locked_by = $1,
    locked_at = now(),
    attempts = attempts + 1
WHERE id = (SELECT id FROM next_job)
RETURNING id, payload, attempts;

SKIP LOCKED is not the whole queue design, but it is a practical primitive for avoiding duplicate row claims under worker concurrency. The broader database-backed queue design is covered in PostgreSQL Job Queues with SKIP LOCKED.

For broker-backed queues, the mechanism may be a visibility timeout, lease, acknowledgment, or consumer group assignment. The principle is the same:

one worker should own an attempt, and the system should know when that ownership expired.

Set Visibility Timeouts From Real Handler Behavior

Visibility timeouts are one of the easiest queue settings to get wrong.

AWS describes the SQS visibility timeout as the period after a consumer receives a message when the message remains temporarily invisible to other consumers. If the consumer does not process and delete the message before the timeout expires, the message becomes visible again and can be processed by another consumer. See the AWS guide to SQS visibility timeout.

The production failure modes are direct:

Timeout choice	Failure mode
Too short	Long-running jobs are picked up by another worker before the first attempt finishes
Too long	Crashed jobs take too long to retry
No extension/heartbeat	Legitimate long jobs become duplicate work
No maximum bound	Stuck work can remain hidden too long

Set the timeout from measured handler duration, not hope.

For long-running work, prefer splitting the job into smaller steps or extending the lease deliberately while the worker is still healthy. A job that needs 40 minutes to finish may really be a workflow made of several shorter jobs.

Retry Policy Is Part Of Correctness

Retries are not just an error-handling detail.

They decide how much repeated work the system will generate under failure.

Use a retry table like this:

Failure type	Example	Retry?	Policy
Transient dependency error	`503`, timeout, connection reset	Yes	exponential backoff with jitter
Rate limit	`429` or quota error	Yes, carefully	respect provider retry hint and cap concurrency
Validation error	malformed payload	No	mark failed or send to review
Missing permanent record	deleted user, unknown payment	Usually no	stop and surface data issue
Ambiguous side effect	provider timed out after possible commit	Not blindly	reconcile before repeating
Code bug	handler throws same error every attempt	No after threshold	DLQ with triage

The dangerous retry is the one that repeats a non-idempotent side effect because the local process did not receive a clean response.

This is the same overload pattern described in Adding Retries Can Make Outages Worse: when the dependency is degraded, retries can turn recovery into traffic amplification.

For background jobs, the retry policy should include:

maximum attempts
backoff and jitter
retryable error classes
permanent failure classes
concurrency caps per dependency
dead-letter threshold
manual replay rules

Do not let "try again later" be the entire design.

Dead-Letter Queues Need A Triage Plan

A dead-letter queue is not a trash folder.

AWS describes DLQs as a way to isolate messages that were not processed successfully so teams can inspect logs, analyze message contents, and move messages out through redrive. See the AWS docs on SQS dead-letter queues.

That only helps if the team has an actual plan.

For each DLQ, define:

Question	Good answer
Who owns this DLQ?	A named team or on-call rotation
What triggers an alert?	Count, age, or business priority threshold
What is safe to replay?	Only messages whose handler is replay-safe
What requires code change first?	Poison messages with deterministic failure
What context is preserved?	original payload, attempts, error class, timestamps, correlation ID
How is replay audited?	operator, time, reason, affected business keys

The worst DLQ outcome is quiet accumulation. The second worst is blind redrive, where the same poison messages are pushed back into the main queue and cause the same failure again.

DLQ triage should decide whether to fix data, fix code, discard, replay, or escalate.

Reconciliation Is Not Optional For Important Workflows

Retries help with many transient failures. They do not catch every ambiguous outcome.

Important workflows need reconciliation.

Examples:

payments marked paid without receipt delivery
invoices generated but not sent
shipments marked ready without carrier labels
users charged but subscription state not activated
exports completed but no notification sent
webhook events stored but downstream work missing

Reconciliation asks:

What should exist if the workflow completed correctly, and what actually exists?

A simple reconciliation query might find paid payments without a sent receipt:

SELECT p.id
FROM payments p
LEFT JOIN email_deliveries d
  ON d.payment_id = p.id
 AND d.email_type = 'receipt'
 AND d.status = 'sent'
WHERE p.status = 'paid'
  AND d.payment_id IS NULL
  AND p.paid_at < now() - interval '10 minutes';

That query is not a queue feature. It is a business-correctness feature.

It gives the system a second chance to notice work that fell between a queue retry, a worker crash, a deployment, or an ambiguous provider response.

Monitor Workflow Health, Not Just Queue Health

Queue dashboards are necessary but incomplete.

Infrastructure metrics tell you whether the worker system is moving:

Metric	What it tells you
Queue depth	How much work is waiting
Oldest job age	Whether work is falling behind
Throughput	How many jobs complete per minute
Retry rate	Whether failures are rising
DLQ count and age	Whether failures are being isolated
Worker concurrency	Whether enough workers are active
Handler duration	Whether jobs are getting slower

Business metrics tell you whether the workflow is correct:

Metric	What it tells you
Duplicate side effects	Idempotency or replay safety is failing
Reconciliation mismatches	Queue motion did not produce expected outcome
Jobs completed without downstream proof	Completion state is too optimistic
Manual replay count	Automation is not recovering cleanly
Deduplication hits	Duplicate delivery is happening and being handled
Age of business workflow	User-visible work is stale even if workers are busy

A queue can be empty because all work succeeded.

It can also be empty because jobs were dropped, dead-lettered, acknowledged too early, or marked complete without the side effect the business actually needed.

Measure both.

Test The Failure Modes

Background-job tests should not only prove the happy path.

They should prove replay safety.

Useful test cases:

handler runs twice for the same business key
worker crashes after side effect but before completion update
dependency returns timeout after committing the side effect
job is retried after a visibility timeout expires
poison message moves to DLQ after the configured threshold
delayed job runs after the underlying record changes
manual replay does not duplicate the side effect

For the receipt email example:

emailClient.send.mockResolvedValueOnce({ providerMessageId: 'msg_123' })

await handleReceiptJob({ paymentId: 'pay_123' })
await handleReceiptJob({ paymentId: 'pay_123' })

expect(emailClient.send).toHaveBeenCalledTimes(1)
expect(await emailDeliveries.find('pay_123', 'receipt')).toMatchObject({
  status: 'sent',
  providerMessageId: 'msg_123',
})

The invariant is more important than the exact implementation:

one paid payment should produce one receipt delivery record and at most one provider send for that receipt.

For the broader shared-state risk, pair this with How to Prevent Race Conditions in Backend Systems.

Rollout Checklist For A New Background Job

Before shipping a production background job, check:

What business key deduplicates the work?
Can the handler run twice without duplicate side effects?
Which state transition proves the job is complete?
What happens if the side effect succeeds but local completion fails?
Which failures are retryable, permanent, or ambiguous?
What is the maximum attempt count?
What backoff and jitter are used?
Does the visibility timeout match measured handler duration?
What sends a message to DLQ?
Who reviews the DLQ?
How are messages replayed safely?
What business metric proves the workflow is correct?
Is there a reconciliation job for important outcomes?
Is the payload versioned enough to survive deploys?
Are correlation IDs or business IDs preserved across the queue boundary?

This checklist is deliberately practical. Background jobs fail in production through small missing decisions, not because teams forgot what a queue is.

Final Takeaway

Background jobs are useful because they decouple slow or unreliable work from the request path.

They are dangerous when that decoupling is mistaken for correctness.

Production-safe background jobs need replay-safe handlers, durable business keys, explicit state transitions, bounded retries, visibility timeouts based on real execution, dead-letter triage, reconciliation, and dashboards that measure workflow correctness.

The queue can deliver work. The application still has to make the work safe.