PostgreSQL Job Queues with SKIP LOCKED

PostgreSQL Job Queues with SKIP LOCKED

Using PostgreSQL as a job queue can work surprisingly well for many systems. It keeps operational complexity low, keeps transactional state close to the application, and avoids introducing another distributed component too early.

Then concurrency arrives.

One worker picks a job while another picks the same row. Workers block each other on locks instead of doing useful work. Retries create duplicate work. A queue that looked simple in development becomes a coordination problem under load.

This is where FOR UPDATE SKIP LOCKED becomes useful.

Why a Database Job Queue Breaks Naive Assumptions

A common first version of a database-backed queue looks like this:

SELECT id, payload
FROM jobs
WHERE status = 'pending'
ORDER BY created_at
LIMIT 1;

Then the worker updates the row:

UPDATE jobs
SET status = 'processing'
WHERE id = $1;

This looks reasonable until multiple workers run it at the same time.

Several workers can read the same pending row before any of them updates it. Now duplicate processing is possible.

Even if the read and update are wrapped in a transaction, another problem appears: workers start blocking each other while they all wait on the same locked rows. Throughput drops even though more workers were added.

That is the core mistake:

the queue is treated like a list of available rows instead of a shared coordination surface under concurrency.

This is the same broader class of issue behind How to Prevent Race Conditions in Backend Systems. The queue is not failing because PostgreSQL is weak. It is failing because work claiming must be concurrency-safe by design.


What FOR UPDATE SKIP LOCKED Actually Does

FOR UPDATE locks the selected rows so another transaction cannot claim the same rows at the same time.

SKIP LOCKED changes the waiting behavior. Instead of blocking on rows another worker already locked, PostgreSQL skips them and moves on to other eligible rows.

That turns "many workers fighting over the same first row" into "many workers claiming different rows concurrently."

A practical claiming query looks like this:

SELECT id, payload
FROM jobs
WHERE status = 'pending'
  AND run_at <= now()
ORDER BY priority DESC, run_at ASC, id ASC
FOR UPDATE SKIP LOCKED
LIMIT 10;

Inside one transaction, the worker selects claimable jobs, locks them, and then marks them as in progress.

Without SKIP LOCKED, workers often serialize themselves accidentally. With it, they can keep moving as long as other eligible rows exist.


Why This Works Better Than a Plain Status Update

The value is not just "avoiding duplicates." It is also avoiding unnecessary waiting.

Under multiple workers, a queue has two separate problems:

  • more than one worker may try to process the same row
  • workers may waste capacity waiting on locks instead of claiming other work

FOR UPDATE SKIP LOCKED addresses both.

It is effectively a pessimistic-locking strategy specialized for work distribution. If you want the wider locking model behind this trade-off, see Optimistic vs Pessimistic Locking in SQL.


A Practical Jobs Table Design

A minimal queue table often needs more than just status and payload.

CREATE TABLE jobs (
  id BIGSERIAL PRIMARY KEY,
  queue_name TEXT NOT NULL DEFAULT 'default',
  payload JSONB NOT NULL,
  status TEXT NOT NULL DEFAULT 'pending',
  priority INTEGER NOT NULL DEFAULT 0,
  attempts INTEGER NOT NULL DEFAULT 0,
  max_attempts INTEGER NOT NULL DEFAULT 25,
  run_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  locked_at TIMESTAMPTZ,
  locked_by TEXT,
  completed_at TIMESTAMPTZ,
  failed_at TIMESTAMPTZ,
  last_error TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_jobs_claim
  ON jobs (queue_name, status, priority DESC, run_at ASC, id ASC);

CREATE INDEX idx_jobs_stuck
  ON jobs (status, locked_at);

Useful fields:

  • run_at supports delayed jobs and retry backoff
  • attempts and max_attempts bound retry behavior
  • locked_at and locked_by help with observability and stuck-job recovery
  • last_error helps explain why retries are happening

If background processing is part of a larger reliability model in your system, this should connect naturally with Background Jobs in Production.


How to Claim Jobs Safely in PostgreSQL

The main rule is simple:

claim and mark the job as in progress inside the same transaction.

One common pattern uses a CTE:

WITH picked_jobs AS (
  SELECT id
  FROM jobs
  WHERE queue_name = $1
    AND status = 'pending'
    AND run_at <= now()
  ORDER BY priority DESC, run_at ASC, id ASC
  FOR UPDATE SKIP LOCKED
  LIMIT $2
)
UPDATE jobs
SET status = 'processing',
    locked_at = now(),
    locked_by = $3,
    attempts = attempts + 1
WHERE id IN (SELECT id FROM picked_jobs)
RETURNING id, payload, attempts, max_attempts;

This pattern matters because it reduces the gap between "this row was chosen" and "this row is now owned by a worker."

At that point the application can safely process only the rows it actually claimed.


PostgreSQL Job Queue Example

At a high level, a worker loop often looks like this:

while (true) {
  const jobs = await repo.claimJobs({
    queueName: 'emails',
    workerId,
    batchSize: 10,
  })

  if (jobs.length === 0) {
    await sleep(500)
    continue
  }

  for (const job of jobs) {
    try {
      await handleJob(job)
      await repo.markCompleted(job.id)
    } catch (error) {
      await repo.markForRetry(job.id, error)
    }
  }
}

That worker loop is simple on purpose. The important detail is not the loop itself, but the contract around state transitions:

  • only claimed jobs may be processed
  • success moves the job to completed
  • failure either reschedules the job or marks it permanently failed
  • retry timing should be explicit, not immediate by default

Retries are especially important here. Bad retry policy can turn a degraded dependency into queue amplification, which is the same failure pattern described in Adding Retries Can Make Outages Worse.


SKIP LOCKED Solves Claim Contention, Not All Queue Problems

This is the part teams often miss.

FOR UPDATE SKIP LOCKED is a strong primitive for concurrent claiming. It does not guarantee:

  • exactly-once processing
  • perfect fairness across all jobs
  • protection from poison messages
  • safe external side effects
  • automatic recovery from worker crashes

If a worker sends an email and crashes before marking the job complete, the job may be retried and the email may be sent twice. That is not a PostgreSQL problem. It is an application correctness problem.

This is where idempotent handlers, deduplication keys, and carefully defined side effects still matter. For API-facing workflows, the same idea appears in Idempotency Keys for Duplicate API Requests.


Handle Retries with Delayed Requeue

When a job fails transiently, moving it straight back to pending is often too aggressive. It creates hot-loop retry behavior:

  • dependency is still unhealthy
  • workers keep reclaiming the same jobs
  • useful throughput falls while queue pressure rises

A safer retry update looks like this:

UPDATE jobs
SET status = CASE
      WHEN attempts >= max_attempts THEN 'failed'
      ELSE 'pending'
    END,
    run_at = CASE
      WHEN attempts >= max_attempts THEN run_at
      ELSE now() + interval '30 seconds'
    END,
    locked_at = NULL,
    locked_by = NULL,
    failed_at = CASE
      WHEN attempts >= max_attempts THEN now()
      ELSE failed_at
    END,
    last_error = $2
WHERE id = $1;

In real systems, backoff usually increases with attempt count. That keeps degraded dependencies from being hammered by the queue itself.


Recover Stuck Jobs After Worker Crashes

If a worker dies after claiming rows, those jobs may remain stuck in processing.

That means every database-backed queue also needs a recovery rule.

For example:

UPDATE jobs
SET status = 'pending',
    locked_at = NULL,
    locked_by = NULL,
    run_at = now()
WHERE status = 'processing'
  AND locked_at < now() - interval '15 minutes';

This is sometimes called a lease timeout model. The lock in PostgreSQL only protects the row while the transaction is open. Once the transaction commits and the worker starts processing, long-running ownership must be represented in application state.

That distinction matters:

  • database row locks protect the claim phase
  • queue state protects the execution phase

If job execution is very long, heartbeats are often safer than one fixed timeout.


Ordering and Fairness Still Need Thought

SKIP LOCKED helps workers avoid waiting, but it can also make queue behavior less fair than teams expect.

For example:

  • rows near the front may be skipped repeatedly if they stay locked a long time
  • one hot queue can starve lower-priority work if workers only scan one index path
  • very large batches can let one worker monopolize the easiest claimable jobs

This usually means queue design needs a few extra decisions:

  • batch size small enough to keep work distributed
  • ordering explicit and stable
  • queue partitioning when different workloads interfere
  • priority used carefully rather than everywhere

If throughput still does not improve after adding more workers, the queue may no longer be the real bottleneck. That broader scaling failure pattern is similar to Why Horizontal Scaling Didn’t Improve Throughput.


When PostgreSQL Is a Good Queue

PostgreSQL-backed queues are often a good fit when:

  • the app already depends heavily on PostgreSQL
  • job volume is moderate
  • transactional coupling with application data is important
  • operational simplicity matters more than broker-level features
  • the team wants one durable system before adding Kafka, RabbitMQ, or SQS

This approach is especially attractive when enqueuing work must happen atomically with a database change. That is one reason it pairs naturally with Transactional Outbox Pattern in Microservices.


When PostgreSQL Is the Wrong Queue

It becomes a weaker fit when:

  • throughput is high enough that queue scanning becomes a major database workload
  • many consumers need independent fan-out semantics
  • strict ordering across partitions matters a lot
  • retention, replay, and stream processing matter more than transactional coupling
  • the database is already the most constrained part of the system

In those cases, introducing a dedicated broker may reduce pressure on the primary database and offer stronger queue-specific controls.


Practical Safeguards for Production

If you build a PostgreSQL queue with SKIP LOCKED, the most useful safeguards are usually:

  • keep claiming transactions short
  • claim in batches, but not oversized batches
  • use delayed retries with backoff
  • track locked_at, locked_by, attempts, and last_error
  • add stuck-job recovery
  • make handlers idempotent where side effects matter
  • monitor queue depth, claim rate, retry rate, failure rate, and job age

The main point is that SKIP LOCKED should be one part of the design, not the whole design.


Final Thought

FOR UPDATE SKIP LOCKED is one of the most practical ways to make a PostgreSQL job queue work under concurrency. It lets workers claim different rows without blocking each other and reduces a large class of duplicate-claim race conditions.

But it is still only the claiming primitive.

Production-safe queues also need retry discipline, idempotent handlers, stuck-job recovery, and clear ownership of side effects. Once those pieces are in place, PostgreSQL can be a very effective queue for a large range of backend systems.