Background Jobs in Production

Background Jobs in Production

Moving work off the request path often improves latency immediately. It does not automatically make the workflow correct.

Queues and workers solve one kind of problem well: they decouple user-facing latency from slower or less reliable work.

But they also introduce new failure modes such as duplicate execution, delayed execution, partial completion, retry storms, poison messages, and business workflows that are technically "processed" but still wrong.

That is why background jobs help only when correctness survives replay, delay, and failure.


What Changes Once Work Leaves The Request Path

Background processing usually starts with a good idea: send email later, generate invoices asynchronously, resize images off the request path, index data after commit, or process webhooks and callbacks in workers.

This lowers request latency and often improves resilience.

Then the production surprises begin: a job succeeds twice, a worker crashes after the side effect but before acknowledgment, retries increase pressure on already-degraded dependencies, or queue depth looks healthy while business outcomes are still wrong.

At that point, teams usually realize an important truth:

Moving work to the background changes where failures happen. It does not remove them.


The Most Important Thing To Understand

Most job systems provide some version of at-least-once delivery.

That means the system is willing to redeliver work rather than risk losing it.

This is usually the right trade-off. But it also means your handler may run more than once.

At-least-once delivery does not guarantee exactly-once side effects, ordered execution, atomicity across database writes and external calls, or business correctness by itself.

So if the handler is not idempotent, retries protect infrastructure while damaging the workflow.


A Concrete Example

Suppose a job sends a receipt email after payment:

export async function handlePaymentReceipt(job) {
  const payment = await db.payment.findUnique({
    where: { id: job.paymentId },
  });

  if (!payment) {
    throw new Error('Payment not found');
  }

  await emailClient.send({
    to: payment.customerEmail,
    template: 'receipt',
    data: { paymentId: payment.id },
  });

  await db.payment.update({
    where: { id: payment.id },
    data: { receiptSentAt: new Date() },
  });
}

This looks fine until a worker crashes after emailClient.send(...) but before receiptSentAt is written.

Now the queue redelivers the job. The infrastructure is doing the correct thing. The customer still gets two emails.

That is the pattern to remember: delivery guarantees and business correctness are not the same thing.


A Failure Timeline Worth Remembering

One reason background-job incidents are so confusing is that the infrastructure can look healthy while the business workflow is already wrong.

Imagine this sequence:

  1. payment succeeds and the app enqueues send-receipt
  2. the worker sends the email successfully
  3. the process restarts during deploy before durable completion is recorded
  4. the queue redelivers the job because it still looks incomplete
  5. a second worker sends the same receipt again

From the queue's perspective, the system behaved exactly as designed. No work was lost. From the customer's perspective, the workflow was still wrong.

That is why queue incidents are often really state-transition incidents. The useful question is not "did the queue redeliver?" It is "what durable evidence existed that the business effect had already happened?"


What Actually Breaks In Production

The handler runs twice

Common causes include a worker crash before acknowledgment, a visibility timeout that is too short, queue redelivery after a transient network failure, or manual replay.

The result is usually duplicate emails, duplicate charges, repeated external calls, or duplicate ledger or analytics entries.

The job is only partially complete

Examples:

  • database state is updated, but the follow-up event is not published
  • external API call succeeds, but local state is never recorded
  • side effect happens, but completion is not persisted

Now retries happen against ambiguous state.

If the workflow must write database state and later publish an event reliably, the boundary is often better handled with an outbox-style approach. See Transactional Outbox Pattern in Microservices.

Retries amplify degradation

Retries are useful for transient failure. They are dangerous when the dependency is already struggling.

At that point, the system starts spending more capacity on less useful work. This is the same failure pattern discussed in Adding Retries Can Make Outages Worse, When Timeouts Didn’t Prevent Cascading Failures, and Rate Limiting and Backpressure in Microservices.

One poison message distorts the queue

A permanently failing job can keep consuming retries, worker time, and alert noise.

If DLQ handling is weak, the system spends capacity on work that cannot succeed.

Ordering assumptions collapse

Jobs created in a sensible order may not execute in that order. Retries can also change the effective sequence.

If correctness depends on strict ordering, the queue alone is not enough.


The Core Design Rule

Treat background jobs as replayable commands, not as one-time actions.

That changes the design questions. If this job runs twice, what happens? If it runs 10 minutes later, what changed? If it succeeds partially, what does the next attempt see? If ordering changes, which state transitions are still valid?

If those questions do not have clear answers, the workflow is not production-safe yet.


Practical Patterns That Actually Help

Idempotent handlers

A handler should be able to detect whether its business effect already happened.

Useful business keys include payment_id, invoice_id, webhook_event_id, user_id + campaign_id, or subscription_id + billing_cycle.

That key is often more important than the queue message ID.

This is closely related to Idempotency Keys for Duplicate API Requests.

Explicit state transitions

Safer:

  • queued -> processing -> completed
  • pending -> sent
  • created -> indexed

Less safe:

  • "if it looks not done, do the work"

Explicit states make retries, reconciliation, and manual replay easier to reason about.

Bounded retries and backoff

Do not retry forever.

Use a bounded retry count, exponential backoff, jitter, and different policies for transient versus permanent failure.

Retries should be controlled recovery, not an infinite work generator.

Dead-letter queues with an actual triage plan

A DLQ is not a trash folder. It is a signal that some class of work needs investigation, replay, or code change.

If nobody reviews DLQ items, the system is just hiding failure farther away from the user.

Visibility timeout aligned with real execution time

If the timeout is too short, in-flight jobs get redelivered. If it is too long, failed work recovers too slowly.

This value should reflect real handler behavior, not optimistic expectation.

Safe claiming semantics

If multiple workers consume the same pool of work, ownership needs to be explicit.

That is one reason row-claiming patterns like FOR UPDATE SKIP LOCKED matter in database-backed queues. See PostgreSQL Job Queues with SKIP LOCKED.


Reconciliation Is Part Of The Design

For financially or operationally important workflows, retries are not enough by themselves. You also want a way to reconcile expected outcomes with actual outcomes later.

That may mean checking for payments that succeeded without a receipt record, shipments marked ready without a carrier label, or invoices marked sent without a provider message ID.

Reconciliation changes the posture of the system. Instead of hoping retries eventually make reality correct, you gain a deliberate repair path for the cases that fall between retries, restarts, deploys, and operator intervention.

This is one reason explicit state transitions matter so much. If the system can distinguish queued, processing, completed, failed, and perhaps needs_review, operators can repair the workflow without inventing new rules during an incident.


Monitoring: Queue Health Is Not Workflow Health

Infrastructure metrics are necessary but incomplete.

Useful infrastructure signals include queue depth, oldest job age, worker concurrency, retry volume, DLQ volume, and handler latency.

Useful business-correctness signals include duplicate side-effect rate, reconciliation mismatches, jobs marked complete without the expected downstream effect, repeated manual replays for the same workflow, and conflict or deduplication hit rate.

A green queue dashboard can still hide a broken workflow.

This is one reason observability needs to connect infrastructure motion with business outcomes. See Observability vs Logging in Production and Correlation IDs in Microservices.


Common Mistakes

These show up repeatedly in production job systems: assuming successful delivery means successful business execution, retrying handlers that are not idempotent, acknowledging before durable state reflects the outcome, storing too little context in the payload, storing too much mutable context in the payload, treating the DLQ as normal backlog, relying on ordering without enforcing it, or having no reconciliation for financially or operationally important workflows.

Most of these are not queue problems. They are workflow-design problems exposed by queue guarantees.


A Practical Rollout Checklist

Before shipping a new background job, check:

  1. What is the business key for deduplication?
  2. Can the handler run twice safely?
  3. Which failures are retryable and which are permanent?
  4. What happens if execution is delayed by minutes or hours?
  5. What happens if the side effect succeeds but acknowledgment or persistence fails?
  6. How will DLQ items be reviewed and replayed?
  7. Which business-level metric shows that the workflow is actually correct?

For high-risk workflows, also ask whether you can reconcile expected versus actual outcomes later, whether downstream systems support idempotency too, and whether retries are bounded enough to avoid amplifying degradation.


Final Thoughts

Background jobs are extremely useful. They reduce latency, absorb spikes, and make many workflows operationally possible.

But they do not make the system correct by themselves.

Reliable background processing comes from the application layer: idempotent handlers, explicit state transitions, bounded retries, safe claiming semantics, and monitoring that measures correctness rather than only queue motion.

That is the difference between a queue that is running and a workflow that remains correct in production.