How to Find and Fix Slow SQL Queries in Production

How to Find and Fix Slow SQL Queries in Production

Slow SQL queries are hard to fix when the investigation starts with the query a developer suspects instead of the query production is actually running.

When an endpoint slows down, the useful question is not "which index should we add?" It is "which SQL statement is doing the work, how often does it run, what plan does the database choose, and what would make that plan cheaper under real data?"

That distinction matters because many SQL fixes are plausible but wrong. An index can make writes slower without changing the selected plan. A cache can hide a bad access pattern while making correctness harder. A read replica can move some traffic while the real bottleneck stays on the primary.

The reliable workflow is narrower: find the real query shape, rank it by production impact, inspect the plan, change one thing, and verify the whole endpoint or job afterward.

For related articles on indexes, replicas, isolation, migrations, and locking, see the SQL And Data Correctness hub.


What A Slow SQL Query Means In Production

A slow SQL query is not just a statement with a large duration in one trace.

In production, a query becomes a real problem when it consumes enough database work to affect a user workflow, background job, or shared resource. That can happen in several ways:

SymptomWhat It Usually Means
One endpoint has high p95 latencyA specific query shape may be expensive for that route
Database CPU stays highMany calls, expensive joins, sorting, or aggregation may dominate work
Connection pool saturation risesQueries may be holding connections too long under concurrency
Slowdowns appear only for large tenantsData distribution or pagination depth may be the real trigger
Queries return few rows but scan manyFiltering, indexing, or selectivity is probably mismatched
Latency spikes during writesLock waits, vacuum pressure, replication, or IO may be involved

This is why a query that takes 35 ms but runs 30,000 times per hour can matter more than a query that takes 900 ms once. Duration is only one dimension. Frequency, rows touched, wait behavior, and workflow impact are the rest of the story.

If the slowdown includes lock waits or transaction behavior, query tuning may not be enough by itself. That boundary is covered in SQL Isolation Levels Explained.


The Production Workflow

Use this sequence before choosing a fix:

  1. Identify the query shape from production evidence.
  2. Connect it to the endpoint, job, or workflow users experience.
  3. Rank the query by total impact, not only average duration.
  4. Capture the plan safely.
  5. Read the plan for row counts, filters, sorts, joins, loops, and buffers.
  6. Apply the smallest fix that matches the plan.
  7. Compare before and after behavior at both query and workflow level.

The order is important. If you skip straight to step 6, you are optimizing a guess.


Step 1: Find The Query Shape Before Opening The ORM

Start from observed production behavior.

Useful sources include slow query logs, APM database spans, trace samples, endpoint-level latency broken down by database time, and database statistics. For PostgreSQL, pg_stat_statements is often the best first shortlist because it tracks planning and execution statistics for normalized SQL statements when the extension is enabled.

A practical starting query looks like this:

SELECT
  queryid,
  calls,
  round(total_exec_time::numeric, 1) AS total_ms,
  round(mean_exec_time::numeric, 1) AS mean_ms,
  rows,
  shared_blks_hit,
  shared_blks_read,
  temp_blks_written,
  query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

The goal is not to immediately fix the top row. The goal is to build a shortlist:

  • statements that consume the most total database time
  • statements that are individually slow enough to affect p95 latency
  • statements that run very frequently
  • statements that read many blocks or write temporary blocks
  • statements whose row count looks suspicious for the user workflow

Do not rank by mean_exec_time alone. A rare administrative query can look dramatic while a modest query running thousands of times per minute is the actual production pressure.

Also avoid reading ORM code first. ORM code can be misleading because one line of application code may produce one query, many queries, different SQL for different parameters, or a query shape that changes when eager loading is added.


Step 2: Connect The Query To A User Workflow

A query is only actionable when you know what causes it.

For each candidate query, identify:

QuestionWhy It Matters
Which endpoint, resolver, job, or batch process runs it?You need a workflow-level success metric after the fix
How often does it run per request?One slow query and 80 small queries require different fixes
Which tenants, accounts, or data shapes make it slower?Production skew often explains why local tests looked fine
Does it slow down under concurrency?The problem may be locks, queues, or pool saturation
Does it return many rows or only inspect many rows?The fix may be pagination, indexing, or query shape

This connection step prevents a common mistake: improving a query benchmark while the user-facing path stays slow.

For example, if an orders endpoint issues one slow order query and then performs an N+1 customer lookup for every row, the database plan is only half the problem. That access-pattern failure is covered directly in N+1 Query Problem in ORMs.


Step 3: Capture The Plan Safely

Once the query shape is clear, inspect the execution plan.

For PostgreSQL, EXPLAIN shows the execution plan the planner chooses. The documentation is explicit that EXPLAIN (ANALYZE) actually executes the statement and adds runtime statistics, so use it carefully. For writes, run it only in a safe environment or inside a transaction you roll back:

BEGIN;

EXPLAIN (ANALYZE, BUFFERS)
UPDATE invoices
SET status = 'expired'
WHERE due_at < now()
  AND status = 'open';

ROLLBACK;

For read queries, a useful default is:

EXPLAIN (ANALYZE, BUFFERS)
SELECT id, customer_id, total_cents, created_at
FROM orders
WHERE account_id = $1
  AND status = 'paid'
  AND archived_at IS NULL
ORDER BY created_at DESC
LIMIT 50;

ANALYZE lets you compare estimates with actual behavior. BUFFERS shows how many shared blocks were hit, read, dirtied, or written, which helps distinguish CPU-heavy work from IO-heavy work. PostgreSQL's Using EXPLAIN guide is worth reading because plans are trees, not flat lists: scan nodes appear below, and sort, join, aggregate, or limit nodes sit above them.

For intermittent production-only slowdowns, PostgreSQL's auto_explain module can log plans for slow statements automatically. Use it deliberately because plan logging and per-node timing can add overhead. It is a diagnostic tool, not something to enable noisily and forget.


How To Read A Query Plan

Do not try to understand every line at once. Read from the bottom of the plan upward, because lower nodes produce rows for higher nodes.

Start with these signals:

Plan SignalWhat To Ask
Seq Scan, Index Scan, Bitmap Heap ScanHow is the table being accessed?
cost=... rows=...What did the planner expect?
actual time=... rows=... loops=...What actually happened at runtime?
Rows Removed by FilterHow much data was read and discarded?
Index Cond vs FilterWhich predicates were handled by the index and which happened afterward?
Sort Method, temp read, temp writtenDid sorting spill or dominate the query?
Nested Loop, Hash Join, Merge JoinDid a join strategy multiply work unexpectedly?
Buffers: shared hit/readWas the query mostly cached, or did it read many pages?

The most useful plan-reading habit is comparing expected rows with actual rows.

If PostgreSQL expected 200 rows and got 180, the planner had a decent picture. If it expected 200 rows and got 180,000, the plan may be wrong because the database had a bad estimate of selectivity, correlation, or data distribution.

That mismatch is one reason adding an index sometimes appears to do nothing. The index exists, but the planner either does not believe it is useful for this query or the index does not match the actual predicates and sort order. That failure mode is explored in Why Database Indexes Didn't Fix Your Slow Query.


A Concrete Example: The Orders Endpoint

Suppose /accounts/:accountId/orders becomes slow for large accounts. The endpoint returns the most recent paid orders.

The table has grown for years:

CREATE TABLE orders (
  id bigserial PRIMARY KEY,
  account_id uuid NOT NULL,
  customer_id uuid NOT NULL,
  status text NOT NULL,
  total_cents integer NOT NULL,
  created_at timestamptz NOT NULL,
  archived_at timestamptz
);

CREATE INDEX orders_account_id_idx
  ON orders (account_id);

The query looks harmless:

SELECT id, customer_id, total_cents, created_at
FROM orders
WHERE account_id = $1
  AND status = 'paid'
  AND archived_at IS NULL
ORDER BY created_at DESC
LIMIT 50;

There is already an index on account_id, so the first instinct may be to look elsewhere. But the plan tells a more specific story:

Limit  (cost=31312.44..31312.56 rows=50 width=40)
       (actual time=346.221..346.235 rows=50 loops=1)
  Buffers: shared hit=17410 read=7289
  ->  Sort  (cost=31312.44..31382.66 rows=28088 width=40)
            (actual time=346.219..346.226 rows=50 loops=1)
        Sort Key: created_at DESC
        Sort Method: top-N heapsort  Memory: 33kB
        Buffers: shared hit=17410 read=7289
        ->  Bitmap Heap Scan on orders
              (cost=1964.37..30379.18 rows=28088 width=40)
              (actual time=18.904..337.812 rows=18422 loops=1)
              Recheck Cond: (account_id = '...'::uuid)
              Filter: ((archived_at IS NULL) AND (status = 'paid'::text))
              Rows Removed by Filter: 154328
              Heap Blocks: exact=24183
              Buffers: shared hit=17407 read=7289
              ->  Bitmap Index Scan using orders_account_id_idx
                    (cost=0.00..1957.35 rows=172750 width=0)
                    (actual time=9.781..9.782 rows=172750 loops=1)
                    Index Cond: (account_id = '...'::uuid)
Planning Time: 0.312 ms
Execution Time: 346.284 ms

This plan says several useful things:

  • the database did use the existing account_id index
  • the index still produced 172,750 candidate rows
  • the status and archived_at predicates were applied after fetching heap rows
  • 154,328 rows were read and discarded
  • the remaining 18,422 rows still had to be sorted by created_at
  • the query returned only 50 rows but touched a lot of data to find them

So the problem is not simply "missing index." The problem is that the existing index matches only one part of the query. It does not match the filter and ordering pattern the endpoint actually needs.


Apply One Fix That Matches The Plan

For this endpoint, a better index should match the stable predicates and the sort order.

One option is a partial index for active paid orders:

CREATE INDEX CONCURRENTLY orders_paid_recent_by_account_idx
  ON orders (account_id, created_at DESC)
  INCLUDE (customer_id, total_cents)
  WHERE status = 'paid'
    AND archived_at IS NULL;

This is not a universal index recipe. It is appropriate only if status = 'paid' and archived_at IS NULL are stable, common predicates for this read path. If the endpoint supports many statuses, a broader composite index such as (account_id, status, created_at DESC) may be more appropriate.

After deploying the index safely, rerun the plan:

Limit  (cost=0.43..92.81 rows=50 width=40)
       (actual time=0.841..1.617 rows=50 loops=1)
  Buffers: shared hit=51 read=4
  ->  Index Scan using orders_paid_recent_by_account_idx on orders
        (cost=0.43..51598.24 rows=27923 width=40)
        (actual time=0.839..1.609 rows=50 loops=1)
        Index Cond: (account_id = '...'::uuid)
        Buffers: shared hit=51 read=4
Planning Time: 0.449 ms
Execution Time: 1.653 ms

The important change is not just the lower execution time. The plan changed shape.

The database can now walk the matching account's paid, unarchived orders in created_at DESC order and stop after 50 rows. There is no large bitmap heap scan, no large post-index filter, and no separate sort over thousands of candidate rows.

That is what a good fix looks like: the plan becomes cheaper for a reason you can explain.

If the index is being added to a large or hot production table, rollout safety matters. Use the same discipline you would use for any schema change: create it without blocking writes when your database supports that, watch replication and write latency, and have a rollback path. The broader rollout workflow is covered in Safe Database Migrations in Production.


Choose The Fix From The Evidence

Not every slow SQL query is an indexing problem.

Use the plan and workflow evidence to choose the fix:

EvidenceLikely Fix
Many rows removed by filterAdd or change an index, narrow predicates, or rewrite the filter
Sort dominates before LIMITAdd an index that matches filter plus order, or change pagination
OFFSET grows with page depthUse keyset pagination or a different product flow
One request runs many similar queriesBatch, eager-load, or change the application access pattern
Actual rows are far higher than estimated rowsRefresh statistics, inspect skew, or reconsider predicates
Temporary blocks are writtenReduce candidate rows, reduce selected columns, or tune sort/hash memory carefully
A join multiplies intermediate rowsFilter earlier, index join keys, or split the query deliberately
Query is fast alone but slow under loadLook for lock waits, connection pool pressure, or IO saturation
Query returns too much data by designChange the endpoint contract, add pagination, or precompute a read model

This table is deliberately practical. The best fix is the one that matches the observed mechanism, not the one that sounds like a common performance tip.


Verify More Than The Single Query

After the change, compare four layers:

LayerWhat To Compare
Query planDid the scan, sort, join, row count, or buffer pattern change?
Query timingDid p50, p95, and p99 improve for the query shape?
Workflow latencyDid the endpoint, job, or user path improve?
System pressureDid CPU, IO, pool usage, lock waits, or replica lag improve?

The endpoint matters because database work is often only one part of the request.

For the orders example, the index may reduce the main query from 346 ms to 2 ms. But if the route still performs 50 separate customer lookups afterward, users may see a smaller improvement than the database benchmark suggests.

The system pressure matters because an optimization can move cost elsewhere. An index can improve reads while increasing write amplification. A query rewrite can reduce CPU while increasing memory. A cache can reduce database calls while introducing invalidation and consistency risks.

Treat a slow-query fix as complete only when the user workflow and the shared database health both improved.


Common Mistakes When Fixing Slow SQL

Ranking by average duration only

Average duration hides production impact. Look at total time, calls, p95 or p99 behavior, rows, blocks, and the workflow that triggers the query.

Trusting local data

Development databases are usually too small, too uniform, and too quiet. A query that is instant locally can be expensive for a tenant with millions of rows, different status distribution, or many concurrent requests.

Adding a generic index

An index on one filtered column may not help a query that also filters by status, excludes archived rows, joins another table, and orders by time. Index design has to match the access pattern.

Ignoring row estimates

When estimated rows and actual rows are very different, the planner may choose a plan that looks reasonable from its perspective but performs badly in production. That can point to stale statistics, skewed data, correlated columns, or predicates that are hard to estimate.

Measuring only the fixed query

The article title may say slow SQL query, but the real goal is not a pretty query benchmark. The goal is a faster and more reliable endpoint, job, or product workflow.

Hiding the problem with cache or replicas

Cache and replicas are useful tools, but they should not be the first move when the query shape is still unknown. They can reduce load in some cases, but they can also preserve a bad access pattern and add new correctness constraints.


A Reusable Slow SQL Checklist

Before calling a slow-query fix done, confirm:

  • the query was found from production evidence, not intuition
  • the query was connected to a specific endpoint, job, or workflow
  • frequency and total database time were considered
  • EXPLAIN or EXPLAIN (ANALYZE, BUFFERS) was reviewed safely
  • estimated rows and actual rows were compared
  • filters, sorts, joins, loops, and buffers were inspected
  • the fix matches the plan evidence
  • only one meaningful change was evaluated at a time
  • before and after plans were saved
  • workflow latency improved, not just single-query timing
  • concurrency, write cost, and rollout risk were considered

This checklist keeps SQL performance work honest. It also gives the team a reviewable artifact instead of a vague statement that "we optimized the query."


Takeaway

Slow SQL fixes work best when they are evidence-shaped.

Find the statement production is actually running. Rank it by impact. Read the plan from the bottom up. Look for rows scanned, rows filtered, sorts, joins, loops, and buffers. Then choose the smallest fix that changes the plan for a reason you can explain.

That process is slower than guessing at an index, but it is much faster than shipping three plausible fixes and still not knowing why the endpoint is slow.