PostgreSQL Job Queues with SKIP LOCKED

Using PostgreSQL as a job queue can work well when the queue is close to application data, job volume is moderate, and the team wants durable background work without operating a separate broker yet.

The mistake is treating a PostgreSQL queue as a simple list of rows.

Under real worker concurrency, a queue table is a shared claim surface. Workers can select the same row, wait behind locked rows, retry too aggressively, leave jobs stuck after crashes, or put so much scanning and updating pressure on the primary database that the queue becomes the bottleneck.

FOR UPDATE SKIP LOCKED is useful because it lets many workers claim different rows without blocking on rows another worker has already locked. But it is only the claiming primitive. A production queue also needs the right schema, indexes, retry policy, stuck-job recovery, cleanup, monitoring, and replay-safe handlers.

For related database correctness topics around locking, isolation, migrations, and query performance, see the SQL And Data Correctness hub.

The Naive Queue Claim That Breaks

A first PostgreSQL-backed queue often starts like this:

SELECT id, payload
FROM jobs
WHERE status = 'pending'
ORDER BY created_at ASC
LIMIT 1;

Then the worker updates the row:

UPDATE jobs
SET status = 'processing'
WHERE id = $1;

This has a race.

Worker A                  Database                  Worker B
--------                  --------                  --------
SELECT pending job 42
                                                    SELECT pending job 42
UPDATE job 42 processing
                                                    UPDATE job 42 processing
process job 42
                                                    process job 42

The query read a claimable job, but the claim was not atomic.

Adding a transaction and FOR UPDATE helps duplicate claiming, but it introduces another issue. If every worker waits on the same first locked row, workers serialize behind one hot row even though other pending jobs exist.

That is exactly the shape SKIP LOCKED is meant to avoid.

What SKIP LOCKED Actually Does

PostgreSQL's SELECT documentation describes FOR UPDATE, FOR NO KEY UPDATE, FOR SHARE, and FOR KEY SHARE as locking clauses. With SKIP LOCKED, rows that cannot be locked immediately are skipped. The docs also call out the trade-off: skipping locked rows gives an inconsistent view of the data, which is not suitable for general-purpose queries, but can be used to avoid lock contention with multiple consumers of a queue-like table. See the PostgreSQL SELECT locking clause.

That is the key mental model:

SKIP LOCKED is not for reading a perfectly ordered truth. It is for letting workers find other available work.

Instead of this:

worker A locks job 1
worker B waits for job 1
worker C waits for job 1

you want this:

worker A locks job 1
worker B skips job 1 and locks job 2
worker C skips jobs 1 and 2 and locks job 3

The result is better worker concurrency. It is not exactly-once processing. It is not fair scheduling. It is not a substitute for idempotent handlers.

It solves claim contention.

For a mechanism-level comparison of this claim against plain FOR UPDATE and advisory locks, including a runnable pgbench harness, see SKIP LOCKED vs FOR UPDATE vs Advisory Locks in Postgres Queues.

A Production-Ready Jobs Table

A queue table needs enough state to answer operational questions later.

CREATE TABLE jobs (
  id bigserial PRIMARY KEY,
  queue_name text NOT NULL DEFAULT 'default',
  job_type text NOT NULL,
  payload jsonb NOT NULL,
  status text NOT NULL DEFAULT 'queued',
  priority integer NOT NULL DEFAULT 0,
  attempts integer NOT NULL DEFAULT 0,
  max_attempts integer NOT NULL DEFAULT 10,
  run_at timestamptz NOT NULL DEFAULT now(),
  locked_at timestamptz,
  locked_by text,
  completed_at timestamptz,
  failed_at timestamptz,
  last_error text,
  created_at timestamptz NOT NULL DEFAULT now(),
  updated_at timestamptz NOT NULL DEFAULT now(),
  CONSTRAINT jobs_status_check CHECK (
    status IN ('queued', 'running', 'completed', 'failed', 'discarded')
  )
);

The fields are not decoration:

Field	Why it matters
`queue_name`	isolates different worker pools
`job_type`	lets handlers route work without guessing from payload shape
`status`	makes lifecycle visible
`priority`	lets urgent jobs run first within a queue
`attempts`	powers bounded retries and alerts
`max_attempts`	keeps poison jobs from looping forever
`run_at`	supports delayed execution and backoff
`locked_at`	supports stuck-job recovery
`locked_by`	tells operators which worker owned the attempt
`last_error`	makes failed jobs debuggable

The exact columns can vary. The shape should not: a production queue needs ownership, retry, timing, and failure state.

For the higher-level handler design around replay safety and business correctness, see Background Jobs in Production.

Index For Runnable Jobs, Not All Jobs

The claim query usually wants a small subset of the table:

one queue
queued jobs
jobs whose run_at is due
highest priority first
oldest due jobs first

A useful starting index is:

CREATE INDEX jobs_runnable_idx
  ON jobs (queue_name, priority DESC, run_at ASC, id ASC)
  WHERE status = 'queued';

This is a partial index. PostgreSQL's documentation describes partial indexes as indexes over a subset of table rows selected by a predicate, useful when queries repeatedly target only part of the table. See PostgreSQL partial indexes.

For a queue, the interesting subset is usually runnable work. Completed jobs should not make the hot claim index larger forever.

You will often also want:

CREATE INDEX jobs_stuck_idx
  ON jobs (locked_at)
  WHERE status = 'running';

CREATE INDEX jobs_failed_idx
  ON jobs (queue_name, failed_at DESC)
  WHERE status = 'failed';

The goal is not to index every column. The goal is to support the actual operational queries:

claim due jobs
find stuck jobs
inspect failures
measure queue age

If the claim query becomes slow, the queue can quietly turn into a database performance incident.

Claim And Mark Jobs In One Statement

The safest claim shape is usually a single transaction that selects, locks, updates ownership, and returns only rows the worker actually owns.

WITH picked AS (
  SELECT id
  FROM jobs
  WHERE queue_name = $1
    AND status = 'queued'
    AND run_at <= now()
  ORDER BY priority DESC, run_at ASC, id ASC
  FOR UPDATE SKIP LOCKED
  LIMIT $2
)
UPDATE jobs
SET status = 'running',
    locked_at = now(),
    locked_by = $3,
    attempts = attempts + 1,
    updated_at = now()
WHERE id IN (SELECT id FROM picked)
RETURNING id, job_type, payload, attempts, max_attempts;

This avoids the gap between "I saw a job" and "I own this job."

The worker should process only the rows returned from this statement.

Keep the claim transaction short. Do not perform the actual job work while holding the row lock. The lock should protect ownership assignment, not the whole external side effect.

PostgreSQL row-level locks are held by transactions. The long-running ownership after the transaction commits is represented by status, locked_at, and locked_by, not by the row lock itself. That distinction matters when designing stuck-job recovery.

For the broader lock behavior, see PostgreSQL's explicit locking documentation.

A Worker Loop That Respects The State Machine

The application loop can stay simple if the state transitions are solid:

while (running) {
  const jobs = await repo.claimJobs({
    queueName: 'emails',
    workerId,
    batchSize: 10,
  })

  if (jobs.length === 0) {
    await sleep(500)
    continue
  }

  for (const job of jobs) {
    try {
      await handlers[job.jobType](job.payload)
      await repo.markCompleted(job.id)
    } catch (error) {
      await repo.markFailedOrRetry(job.id, error)
    }
  }
}

Completion should be explicit:

UPDATE jobs
SET status = 'completed',
    completed_at = now(),
    locked_at = NULL,
    locked_by = NULL,
    updated_at = now()
WHERE id = $1
  AND status = 'running'
  AND locked_by = $2;

The locked_by condition prevents one worker from completing a job that another worker reclaimed after a timeout.

That may feel defensive. It is exactly the sort of defensive boundary production workers need.

Retry With Backoff, Not Hot Loops

When a job fails, do not immediately put it back at the front of the same queue.

That can create hot loops where workers repeatedly hammer the same broken dependency or poison payload.

A safer retry update sets a future run_at:

UPDATE jobs
SET status = CASE
      WHEN attempts >= max_attempts THEN 'failed'
      ELSE 'queued'
    END,
    run_at = CASE
      WHEN attempts >= max_attempts THEN run_at
      ELSE now() + make_interval(secs => LEAST(900, POWER(2, attempts)::int))
    END,
    failed_at = CASE
      WHEN attempts >= max_attempts THEN now()
      ELSE failed_at
    END,
    locked_at = NULL,
    locked_by = NULL,
    last_error = left($2, 2000),
    updated_at = now()
WHERE id = $1
  AND status = 'running'
  AND locked_by = $3;

In a real system, add jitter so many jobs do not wake at the same second.

The important behavior is:

transient failures retry later
permanent failures stop
repeated failures become visible
retries do not monopolize the queue

If retries are already causing pressure elsewhere, the failure mode is the same one described in Adding Retries Can Make Outages Worse.

Recover Stuck Jobs With A Lease Rule

If a worker crashes after claiming jobs, those jobs can remain running.

The row lock is gone because the claim transaction committed or the session ended. But the application state still says the job is owned.

You need a lease rule:

UPDATE jobs
SET status = 'queued',
    locked_at = NULL,
    locked_by = NULL,
    run_at = now(),
    updated_at = now()
WHERE status = 'running'
  AND locked_at < now() - interval '15 minutes'
RETURNING id, queue_name, locked_by, locked_at;

This should not be a silent cleanup hidden in a random worker.

Treat stuck-job recovery as an operational path:

count recovered jobs
log locked_by and locked_at
alert if recovery volume rises
choose timeout from measured handler duration
avoid reclaiming genuinely long-running jobs too early

For long jobs, use heartbeats:

UPDATE jobs
SET locked_at = now(),
    updated_at = now()
WHERE id = $1
  AND status = 'running'
  AND locked_by = $2;

A heartbeat makes ownership visible while the worker is still alive.

Keep Handlers Idempotent

SKIP LOCKED prevents many duplicate claims. It does not make side effects exactly once.

This timeline can still happen:

worker A claims job 42
worker A sends receipt email
worker A crashes before markCompleted
stuck-job recovery requeues job 42
worker B claims job 42
worker B sends receipt email again

The queue did its job. The handler was not replay-safe.

For important side effects, store a business-level idempotency key outside the queue message:

CREATE TABLE email_deliveries (
  payment_id uuid NOT NULL,
  email_type text NOT NULL,
  provider_message_id text,
  sent_at timestamptz,
  PRIMARY KEY (payment_id, email_type)
);

That lets a retried job ask whether the receipt was already sent for payment_id + email_type.

The job row is not the business invariant. The job row is the work request. The business table is what prevents duplicate outcomes.

This is the same reason production workers need replay-safe handlers, not only queue-level retries.

Clean Up Completed Jobs Deliberately

A PostgreSQL queue is a write-heavy table.

Every claim, retry, completion, and failure updates rows and indexes. If completed jobs stay forever in the same hot table, the queue gets slower and larger even when workers are healthy.

Common cleanup strategies:

Strategy	When it fits	Trade-off
Delete completed jobs after retention window	Simple queues with separate audit logs	Less local history
Move completed jobs to archive table	Need short-term queryable history	More maintenance
Partition by time or status	High volume and predictable retention	More schema complexity
Keep only failures and recent completions	Operational history matters more than full history	Requires clear retention policy

Do not let cleanup be an afterthought. It affects index size, vacuum work, backup size, and claim-query latency.

At minimum, decide:

how long completed jobs stay
how long failed jobs stay
where audit history lives
which tables are in the hot claim path
what metric tells you cleanup is falling behind

Monitor Queue Health And Database Pressure

A PostgreSQL queue needs both queue metrics and database metrics.

Queue metrics:

Metric	Why it matters
runnable job count	backlog that workers can claim now
oldest runnable job age	user-visible delay and starvation
claim rate	whether workers are finding work
completion rate	whether work is finishing
retry rate	dependency or handler instability
failed-job count	poison payload or code issues
stuck-job recoveries	crashed workers or bad timeouts

Database metrics:

Metric	Why it matters
claim query latency	queue overhead on workers
rows scanned per claim	index or predicate mismatch
table and index size	cleanup and bloat pressure
autovacuum activity	write-heavy table health
lock waits	transaction or query-shape problems
database CPU and I/O	queue competing with product workload

The dangerous situation is a queue that appears busy while the database is paying for excessive scans, updates, and cleanup.

If the queue shares the primary database with user-facing traffic, treat database pressure as part of the queue's cost.

When PostgreSQL Is A Good Queue

PostgreSQL is often a good queue when:

the application already depends on PostgreSQL
job volume is moderate
operational simplicity matters
jobs are closely tied to database state
enqueueing work must be atomic with a database transaction
workers need durable state more than broker-specific features
the team is not ready to operate a dedicated broker well

This is especially useful for internal workflows, notification jobs, small billing tasks, search indexing, webhooks that need durable processing, and outbox relays.

It pairs naturally with the Transactional Outbox Pattern, where publishable rows live in the database and workers claim them safely.

When PostgreSQL Is The Wrong Queue

PostgreSQL becomes a weaker fit when:

queue volume is high enough to compete with core database workload
fan-out/pub-sub semantics are central
strict ordering across partitions matters
long retention and replay are core requirements
many independent consumer groups need the same events
the database is already the system bottleneck
queue scans, updates, and cleanup dominate database maintenance

In those cases, Kafka, RabbitMQ, SQS, or another broker may be a better operational boundary.

The point is not "PostgreSQL queues are bad." The point is that the database is already doing important work. A queue table adds more writes, locks, indexes, vacuum pressure, and operational coupling.

Use PostgreSQL when those trade-offs are worth the simplicity.

Move away when the queue deserves its own system.

Production Checklist

Before trusting a PostgreSQL queue in production, check:

claim and ownership update happen in one transaction
claim query uses a predictable order
runnable jobs have a focused index
completed jobs do not bloat the hot claim path forever
retries use backoff and max attempts
stuck jobs have a lease or heartbeat rule
handlers are idempotent where side effects matter
failed jobs preserve enough error context
workers have bounded batch sizes
queue metrics and database pressure are monitored
cleanup and archival are defined
the team knows when PostgreSQL should stop being the queue

FOR UPDATE SKIP LOCKED is the beginning of the design, not the end.

Final Takeaway

FOR UPDATE SKIP LOCKED makes PostgreSQL job queues practical by letting concurrent workers skip locked rows and claim different work.

That solves the row-claiming bottleneck. It does not solve retry policy, stuck-job recovery, handler idempotency, cleanup, ordering, fairness, observability, or database pressure.

A production PostgreSQL queue works when the whole system is designed around those realities: focused schema, correct indexes, atomic claiming, bounded retries, explicit leases, replay-safe handlers, and a clear decision about when a dedicated broker is the better tool.