Background Jobs in Production

Background Jobs in Production

Moving work off the request path improves latency, but it also moves failures into places teams observe less directly. Queues, workers, and retries make systems more flexible only if correctness survives duplicate execution and delayed handling.

What Changes Once Work Leaves the Request Path

The API is fast because the expensive work was moved out of the request path.

Emails are sent asynchronously. Invoices are generated by workers. Webhook delivery happens through a queue. Image processing, analytics aggregation, search indexing, and cache warming all run in background jobs.

This usually feels like an architectural improvement. User-facing latency drops. Retries become easier. Failures can be isolated from the request-response cycle.

Then production behavior becomes more complicated than expected.

The queue remains healthy, but duplicate emails are sent. Some jobs succeed technically while leaving the system in the wrong state. Retries recover many failures, but they also create repeated side effects. Dead-letter queues grow even though no obvious infrastructure outage is happening.

At that point, teams often realize a difficult truth:

Moving work to the background changes where failures happen. It does not remove them.


Why Queues Feel Like Automatic Reliability

Background jobs are often introduced with a straightforward expectation:

If a task is retried until it succeeds, the system becomes more reliable.

That assumption is reasonable.

Queues buffer spikes. Workers can scale independently. At-least-once delivery sounds safer than best-effort execution. Transient failures can be retried without blocking users.

From that perspective, background jobs appear easier to reason about than synchronous request paths. If one attempt fails, another attempt will eventually complete the work.

But that model quietly assumes that “job completed” and “business outcome correct” are the same thing.

In production, they often are not.


What At-Least-Once Delivery Actually Means

At-least-once delivery does not mean:

  • the job runs exactly once
  • side effects happen exactly once
  • messages are processed in order
  • a successful acknowledgment means the whole business operation is correct

It means something narrower:

the system will try hard to deliver the message, even if that causes duplicate processing.

This tradeoff is usually correct. It protects against message loss.

But once duplicate delivery is accepted, correctness must come from somewhere else:

  • idempotent handlers
  • explicit deduplication
  • safe retry boundaries
  • clear ownership of side effects

Without those controls, the queue is reliable while the business workflow is not.


What Actually Breaks in Production

Background jobs fail in ways that are easy to miss in development.

The handler runs twice

A worker processes the message, performs the side effect, then crashes before acknowledgment. The queue redelivers the message.

From the queue’s perspective, this is correct behavior. From the product’s perspective, it may mean:

  • two emails
  • two charges
  • two downstream API calls
  • duplicated analytics or ledger entries

A job is “successful” but incomplete

The handler updates the database, then fails before publishing a follow-up event. Or it calls an external API, then fails before persisting the resulting state.

The worker may retry, but now it is retrying from an ambiguous partial state.

Retries amplify bad conditions

If a dependency is degraded, retries increase queue pressure and worker occupancy at the same time. The system starts doing more work while useful throughput decreases.

This is the same failure pattern described in Adding Retries Can Make Outages Worse, but queues make it easier to hide because the user is no longer waiting directly.

At that point, async systems start needing the same protections as synchronous ones: bounded admission, timeout discipline, and fast failure around unhealthy dependencies. That is closely related to When Timeouts Didn’t Prevent Cascading Failures, Circuit Breaker Pattern in Microservices, and Rate Limiting and Backpressure in Microservices.

One poison message blocks useful work

A single malformed or permanently failing job can consume repeated attempts and distort queue health metrics. If retry policy is weak, the system spends capacity on work that cannot succeed.

Ordering assumptions collapse

Two jobs that were created in a sensible order may not execute in that order. Retries can also reorder outcomes relative to first-attempt processing.

If correctness depends on strict ordering, the queue is not enough on its own.


Illustrative Example

Consider a worker that sends a receipt email after payment:

export async function handlePaymentReceipt(job) {
  const payment = await db.payment.findUnique({
    where: { id: job.paymentId },
  });

  if (!payment) {
    throw new Error('Payment not found');
  }

  await emailClient.send({
    to: payment.customerEmail,
    template: 'receipt',
    data: { paymentId: payment.id },
  });

  await db.payment.update({
    where: { id: payment.id },
    data: { receiptSentAt: new Date() },
  });
}

This looks reasonable.

But if the worker crashes after emailClient.send(...) and before receiptSentAt is updated, the job is retried. Now the customer gets two receipts.

The infrastructure behaved correctly. The business workflow did not.


Where Delivery Guarantees Stop Helping

The problem is not background processing itself. The problem is the gap between delivery guarantees and application guarantees.

Delivery and side effects are not atomic

Most queues cannot atomically guarantee all of the following together:

  • handler execution
  • database mutation
  • external API side effect
  • acknowledgment

Some part of the workflow always sits across a failure boundary. That is where duplicates and partial completion come from.

Retries are correctness-sensitive

Retries only help when the handler can safely run again.

That usually requires:

  • idempotent writes
  • deduplication keys
  • state transitions that tolerate repetition
  • external calls that can be replayed safely

If those conditions are missing, retries recover infrastructure while corrupting business behavior.

Queue health can hide product-level failure

Teams often monitor:

  • queue depth
  • worker throughput
  • retry count
  • dead-letter count

Those metrics matter, but they do not answer questions like:

  • Were duplicate emails sent?
  • Did we charge twice?
  • Did one invoice generate multiple ledger entries?
  • Did the downstream system receive contradictory updates?

A green queue dashboard can coexist with incorrect outcomes.

Time changes system behavior

Short delays are easy. Long delays expose more drift:

  • records are deleted before the job runs
  • state has changed since enqueue time
  • retries happen after business context expired
  • dependent entities moved to a different version or schema

Background systems are not only concurrent. They are temporally decoupled, which creates another class of correctness problems.


The Core Design Rule

Treat background jobs as replayable commands, not as one-time actions.

That changes the design questions:

  • If this job runs twice, what happens?
  • If it runs after 10 minutes, what changed?
  • If it succeeds partially, can another attempt finish safely?
  • If ordering changes, which state transitions remain valid?

If those questions do not have clear answers, the handler is not production-safe yet.


Practical Patterns That Actually Help

Idempotent job handlers

Each job should have a stable business key or deduplication key.

Examples:

  • payment_id
  • invoice_id
  • email_template + user_id + billing_cycle
  • webhook_event_id

That key should allow the handler to detect whether the effect has already been applied.

This is closely related to the API-side patterns in Idempotency Keys for Duplicate API Requests.

Explicit state transitions

Avoid handlers that infer too much from “current state.”

Safer:

  • pending -> processing -> completed
  • queued -> sent
  • created -> indexed

Less safe:

  • run if “it looks not done yet”

Explicit states make retries and reconciliation easier to reason about.

Retry budgets and backoff

Do not retry forever.

Use:

  • bounded retry counts
  • exponential backoff with jitter
  • different policies for transient vs permanent failures

Retries should be a controlled recovery mechanism, not an infinite work generator.

Dead-letter queues with triage intent

A dead-letter queue is not a trash bin. It is an operational signal that some class of work needs investigation or replay strategy.

If DLQ messages are never reviewed, the queue is only hiding failures farther away from users.

Visibility timeout aligned with real execution time

If the visibility timeout is too short, jobs get redelivered while still in progress. If it is too long, failed work takes too long to recover.

This timeout must reflect real handler behavior, not optimistic estimates.

Outbox-style coordination for event publishing

If a request both updates the database and needs to emit an event, writing the event intention to durable storage as part of the same transaction is often safer than trying to coordinate DB commit and message publish directly.

This does not remove complexity. It moves complexity toward a more explicit and recoverable model.


What to Monitor Beyond Queue Depth

If you only track infrastructure health, you will miss business-level failure.

Track both:

Infrastructure signals

  • queue depth
  • job age
  • retry volume
  • dead-letter volume
  • worker concurrency
  • handler latency

Business correctness signals

  • duplicate side-effect rate
  • reconciliation mismatches
  • state transition failures
  • jobs completed without expected downstream effect
  • jobs replayed successfully after manual intervention

This is the difference between “the queue is running” and “the workflow is correct.”


Common Mistakes

These patterns repeatedly cause trouble in production job systems:

  • assuming successful delivery means successful business execution
  • retrying non-idempotent handlers
  • storing too little context in the job payload
  • storing too much mutable context in the job payload
  • treating dead-letter queues as normal backlog
  • relying on execution order without enforcing it explicitly
  • acknowledging work before durable state reflects the outcome
  • omitting reconciliation for financially or operationally important workflows

Most of these are not infrastructure problems. They are workflow design problems exposed by infrastructure guarantees.


Practical Rollout Checklist

Before shipping a new background job:

  1. Define the business key for deduplication.
  2. Decide which failures are retryable and which are permanent.
  3. Verify the handler can run twice without corrupting state.
  4. Check what happens if execution pauses for minutes or hours.
  5. Add monitoring for both queue health and business outcome.
  6. Define how DLQ items are triaged and replayed.

For high-risk workflows, also ask:

  1. Can we reconcile expected vs actual outcomes later?
  2. Do external side effects support idempotency or deduplication?
  3. What happens if the worker crashes after the side effect but before acknowledgment?

These questions are where most production failures become visible before launch.


Closing Reflection

Background jobs are often introduced as a way to make systems more reliable. In one sense, they do exactly that: they absorb spikes, decouple latency, and improve delivery resilience.

But at-least-once delivery solves a narrower problem than many teams assume. It makes loss less likely. It does not make duplication, partial completion, or business inconsistency disappear.

Reliable background processing comes from the application layer: idempotent handlers, explicit state transitions, bounded retries, and observability that measures correctness rather than queue motion.

That is the difference between a queue that works and a workflow that remains correct in production.