PostgreSQL Job Queues with SKIP LOCKED

PostgreSQL Job Queues with SKIP LOCKED

Using PostgreSQL as a job queue can work well when the queue is close to application data, job volume is moderate, and the team wants durable background work without operating a separate broker yet.

The mistake is treating a PostgreSQL queue as a simple list of rows.

Under real worker concurrency, a queue table is a shared claim surface. Workers can select the same row, wait behind locked rows, retry too aggressively, leave jobs stuck after crashes, or put so much scanning and updating pressure on the primary database that the queue becomes the bottleneck.

FOR UPDATE SKIP LOCKED is useful because it lets many workers claim different rows without blocking on rows another worker has already locked. But it is only the claiming primitive. A production queue also needs the right schema, indexes, retry policy, stuck-job recovery, cleanup, monitoring, and replay-safe handlers.

For related database correctness topics around locking, isolation, migrations, and query performance, see the SQL And Data Correctness hub.


The Naive Queue Claim That Breaks

A first PostgreSQL-backed queue often starts like this:

SELECT id, payload
FROM jobs
WHERE status = 'pending'
ORDER BY created_at ASC
LIMIT 1;

Then the worker updates the row:

UPDATE jobs
SET status = 'processing'
WHERE id = $1;

This has a race.

Worker A                  Database                  Worker B
--------                  --------                  --------
SELECT pending job 42
                                                    SELECT pending job 42
UPDATE job 42 processing
                                                    UPDATE job 42 processing
process job 42
                                                    process job 42

The query read a claimable job, but the claim was not atomic.

Adding a transaction and FOR UPDATE helps duplicate claiming, but it introduces another issue. If every worker waits on the same first locked row, workers serialize behind one hot row even though other pending jobs exist.

That is exactly the shape SKIP LOCKED is meant to avoid.


What SKIP LOCKED Actually Does

PostgreSQL's SELECT documentation describes FOR UPDATE, FOR NO KEY UPDATE, FOR SHARE, and FOR KEY SHARE as locking clauses. With SKIP LOCKED, rows that cannot be locked immediately are skipped. The docs also call out the trade-off: skipping locked rows gives an inconsistent view of the data, which is not suitable for general-purpose queries, but can be used to avoid lock contention with multiple consumers of a queue-like table. See the PostgreSQL SELECT locking clause.

That is the key mental model:

SKIP LOCKED is not for reading a perfectly ordered truth. It is for letting workers find other available work.

Instead of this:

worker A locks job 1
worker B waits for job 1
worker C waits for job 1

you want this:

worker A locks job 1
worker B skips job 1 and locks job 2
worker C skips jobs 1 and 2 and locks job 3

The result is better worker concurrency. It is not exactly-once processing. It is not fair scheduling. It is not a substitute for idempotent handlers.

It solves claim contention.


A Production-Ready Jobs Table

A queue table needs enough state to answer operational questions later.

CREATE TABLE jobs (
  id bigserial PRIMARY KEY,
  queue_name text NOT NULL DEFAULT 'default',
  job_type text NOT NULL,
  payload jsonb NOT NULL,
  status text NOT NULL DEFAULT 'queued',
  priority integer NOT NULL DEFAULT 0,
  attempts integer NOT NULL DEFAULT 0,
  max_attempts integer NOT NULL DEFAULT 10,
  run_at timestamptz NOT NULL DEFAULT now(),
  locked_at timestamptz,
  locked_by text,
  completed_at timestamptz,
  failed_at timestamptz,
  last_error text,
  created_at timestamptz NOT NULL DEFAULT now(),
  updated_at timestamptz NOT NULL DEFAULT now(),
  CONSTRAINT jobs_status_check CHECK (
    status IN ('queued', 'running', 'completed', 'failed', 'discarded')
  )
);

The fields are not decoration:

FieldWhy it matters
queue_nameisolates different worker pools
job_typelets handlers route work without guessing from payload shape
statusmakes lifecycle visible
prioritylets urgent jobs run first within a queue
attemptspowers bounded retries and alerts
max_attemptskeeps poison jobs from looping forever
run_atsupports delayed execution and backoff
locked_atsupports stuck-job recovery
locked_bytells operators which worker owned the attempt
last_errormakes failed jobs debuggable

The exact columns can vary. The shape should not: a production queue needs ownership, retry, timing, and failure state.

For the higher-level handler design around replay safety and business correctness, see Background Jobs in Production.


Index For Runnable Jobs, Not All Jobs

The claim query usually wants a small subset of the table:

  • one queue
  • queued jobs
  • jobs whose run_at is due
  • highest priority first
  • oldest due jobs first

A useful starting index is:

CREATE INDEX jobs_runnable_idx
  ON jobs (queue_name, priority DESC, run_at ASC, id ASC)
  WHERE status = 'queued';

This is a partial index. PostgreSQL's documentation describes partial indexes as indexes over a subset of table rows selected by a predicate, useful when queries repeatedly target only part of the table. See PostgreSQL partial indexes.

For a queue, the interesting subset is usually runnable work. Completed jobs should not make the hot claim index larger forever.

You will often also want:

CREATE INDEX jobs_stuck_idx
  ON jobs (locked_at)
  WHERE status = 'running';

CREATE INDEX jobs_failed_idx
  ON jobs (queue_name, failed_at DESC)
  WHERE status = 'failed';

The goal is not to index every column. The goal is to support the actual operational queries:

  • claim due jobs
  • find stuck jobs
  • inspect failures
  • measure queue age

If the claim query becomes slow, the queue can quietly turn into a database performance incident.


Claim And Mark Jobs In One Statement

The safest claim shape is usually a single transaction that selects, locks, updates ownership, and returns only rows the worker actually owns.

WITH picked AS (
  SELECT id
  FROM jobs
  WHERE queue_name = $1
    AND status = 'queued'
    AND run_at <= now()
  ORDER BY priority DESC, run_at ASC, id ASC
  FOR UPDATE SKIP LOCKED
  LIMIT $2
)
UPDATE jobs
SET status = 'running',
    locked_at = now(),
    locked_by = $3,
    attempts = attempts + 1,
    updated_at = now()
WHERE id IN (SELECT id FROM picked)
RETURNING id, job_type, payload, attempts, max_attempts;

This avoids the gap between "I saw a job" and "I own this job."

The worker should process only the rows returned from this statement.

Keep the claim transaction short. Do not perform the actual job work while holding the row lock. The lock should protect ownership assignment, not the whole external side effect.

PostgreSQL row-level locks are held by transactions. The long-running ownership after the transaction commits is represented by status, locked_at, and locked_by, not by the row lock itself. That distinction matters when designing stuck-job recovery.

For the broader lock behavior, see PostgreSQL's explicit locking documentation.


A Worker Loop That Respects The State Machine

The application loop can stay simple if the state transitions are solid:

while (running) {
  const jobs = await repo.claimJobs({
    queueName: 'emails',
    workerId,
    batchSize: 10,
  })

  if (jobs.length === 0) {
    await sleep(500)
    continue
  }

  for (const job of jobs) {
    try {
      await handlers[job.jobType](job.payload)
      await repo.markCompleted(job.id)
    } catch (error) {
      await repo.markFailedOrRetry(job.id, error)
    }
  }
}

Completion should be explicit:

UPDATE jobs
SET status = 'completed',
    completed_at = now(),
    locked_at = NULL,
    locked_by = NULL,
    updated_at = now()
WHERE id = $1
  AND status = 'running'
  AND locked_by = $2;

The locked_by condition prevents one worker from completing a job that another worker reclaimed after a timeout.

That may feel defensive. It is exactly the sort of defensive boundary production workers need.


Retry With Backoff, Not Hot Loops

When a job fails, do not immediately put it back at the front of the same queue.

That can create hot loops where workers repeatedly hammer the same broken dependency or poison payload.

A safer retry update sets a future run_at:

UPDATE jobs
SET status = CASE
      WHEN attempts >= max_attempts THEN 'failed'
      ELSE 'queued'
    END,
    run_at = CASE
      WHEN attempts >= max_attempts THEN run_at
      ELSE now() + make_interval(secs => LEAST(900, POWER(2, attempts)::int))
    END,
    failed_at = CASE
      WHEN attempts >= max_attempts THEN now()
      ELSE failed_at
    END,
    locked_at = NULL,
    locked_by = NULL,
    last_error = left($2, 2000),
    updated_at = now()
WHERE id = $1
  AND status = 'running'
  AND locked_by = $3;

In a real system, add jitter so many jobs do not wake at the same second.

The important behavior is:

  • transient failures retry later
  • permanent failures stop
  • repeated failures become visible
  • retries do not monopolize the queue

If retries are already causing pressure elsewhere, the failure mode is the same one described in Adding Retries Can Make Outages Worse.


Recover Stuck Jobs With A Lease Rule

If a worker crashes after claiming jobs, those jobs can remain running.

The row lock is gone because the claim transaction committed or the session ended. But the application state still says the job is owned.

You need a lease rule:

UPDATE jobs
SET status = 'queued',
    locked_at = NULL,
    locked_by = NULL,
    run_at = now(),
    updated_at = now()
WHERE status = 'running'
  AND locked_at < now() - interval '15 minutes'
RETURNING id, queue_name, locked_by, locked_at;

This should not be a silent cleanup hidden in a random worker.

Treat stuck-job recovery as an operational path:

  • count recovered jobs
  • log locked_by and locked_at
  • alert if recovery volume rises
  • choose timeout from measured handler duration
  • avoid reclaiming genuinely long-running jobs too early

For long jobs, use heartbeats:

UPDATE jobs
SET locked_at = now(),
    updated_at = now()
WHERE id = $1
  AND status = 'running'
  AND locked_by = $2;

A heartbeat makes ownership visible while the worker is still alive.


Keep Handlers Idempotent

SKIP LOCKED prevents many duplicate claims. It does not make side effects exactly once.

This timeline can still happen:

worker A claims job 42
worker A sends receipt email
worker A crashes before markCompleted
stuck-job recovery requeues job 42
worker B claims job 42
worker B sends receipt email again

The queue did its job. The handler was not replay-safe.

For important side effects, store a business-level idempotency key outside the queue message:

CREATE TABLE email_deliveries (
  payment_id uuid NOT NULL,
  email_type text NOT NULL,
  provider_message_id text,
  sent_at timestamptz,
  PRIMARY KEY (payment_id, email_type)
);

That lets a retried job ask whether the receipt was already sent for payment_id + email_type.

The job row is not the business invariant. The job row is the work request. The business table is what prevents duplicate outcomes.

This is the same reason production workers need replay-safe handlers, not only queue-level retries.


Clean Up Completed Jobs Deliberately

A PostgreSQL queue is a write-heavy table.

Every claim, retry, completion, and failure updates rows and indexes. If completed jobs stay forever in the same hot table, the queue gets slower and larger even when workers are healthy.

Common cleanup strategies:

StrategyWhen it fitsTrade-off
Delete completed jobs after retention windowSimple queues with separate audit logsLess local history
Move completed jobs to archive tableNeed short-term queryable historyMore maintenance
Partition by time or statusHigh volume and predictable retentionMore schema complexity
Keep only failures and recent completionsOperational history matters more than full historyRequires clear retention policy

Do not let cleanup be an afterthought. It affects index size, vacuum work, backup size, and claim-query latency.

At minimum, decide:

  • how long completed jobs stay
  • how long failed jobs stay
  • where audit history lives
  • which tables are in the hot claim path
  • what metric tells you cleanup is falling behind

Monitor Queue Health And Database Pressure

A PostgreSQL queue needs both queue metrics and database metrics.

Queue metrics:

MetricWhy it matters
runnable job countbacklog that workers can claim now
oldest runnable job ageuser-visible delay and starvation
claim ratewhether workers are finding work
completion ratewhether work is finishing
retry ratedependency or handler instability
failed-job countpoison payload or code issues
stuck-job recoveriescrashed workers or bad timeouts

Database metrics:

MetricWhy it matters
claim query latencyqueue overhead on workers
rows scanned per claimindex or predicate mismatch
table and index sizecleanup and bloat pressure
autovacuum activitywrite-heavy table health
lock waitstransaction or query-shape problems
database CPU and I/Oqueue competing with product workload

The dangerous situation is a queue that appears busy while the database is paying for excessive scans, updates, and cleanup.

If the queue shares the primary database with user-facing traffic, treat database pressure as part of the queue's cost.


When PostgreSQL Is A Good Queue

PostgreSQL is often a good queue when:

  • the application already depends on PostgreSQL
  • job volume is moderate
  • operational simplicity matters
  • jobs are closely tied to database state
  • enqueueing work must be atomic with a database transaction
  • workers need durable state more than broker-specific features
  • the team is not ready to operate a dedicated broker well

This is especially useful for internal workflows, notification jobs, small billing tasks, search indexing, webhooks that need durable processing, and outbox relays.

It pairs naturally with the Transactional Outbox Pattern, where publishable rows live in the database and workers claim them safely.


When PostgreSQL Is The Wrong Queue

PostgreSQL becomes a weaker fit when:

  • queue volume is high enough to compete with core database workload
  • fan-out/pub-sub semantics are central
  • strict ordering across partitions matters
  • long retention and replay are core requirements
  • many independent consumer groups need the same events
  • the database is already the system bottleneck
  • queue scans, updates, and cleanup dominate database maintenance

In those cases, Kafka, RabbitMQ, SQS, or another broker may be a better operational boundary.

The point is not "PostgreSQL queues are bad." The point is that the database is already doing important work. A queue table adds more writes, locks, indexes, vacuum pressure, and operational coupling.

Use PostgreSQL when those trade-offs are worth the simplicity.

Move away when the queue deserves its own system.


Production Checklist

Before trusting a PostgreSQL queue in production, check:

  • claim and ownership update happen in one transaction
  • claim query uses a predictable order
  • runnable jobs have a focused index
  • completed jobs do not bloat the hot claim path forever
  • retries use backoff and max attempts
  • stuck jobs have a lease or heartbeat rule
  • handlers are idempotent where side effects matter
  • failed jobs preserve enough error context
  • workers have bounded batch sizes
  • queue metrics and database pressure are monitored
  • cleanup and archival are defined
  • the team knows when PostgreSQL should stop being the queue

FOR UPDATE SKIP LOCKED is the beginning of the design, not the end.


Final Takeaway

FOR UPDATE SKIP LOCKED makes PostgreSQL job queues practical by letting concurrent workers skip locked rows and claim different work.

That solves the row-claiming bottleneck. It does not solve retry policy, stuck-job recovery, handler idempotency, cleanup, ordering, fairness, observability, or database pressure.

A production PostgreSQL queue works when the whole system is designed around those realities: focused schema, correct indexes, atomic claiming, bounded retries, explicit leases, replay-safe handlers, and a clear decision about when a dedicated broker is the better tool.