
How to Find and Fix Slow SQL Queries in Production
Slow SQL issues usually look obvious only after the real bottleneck has already been measured. Before that, teams often spend time optimizing the wrong query, the wrong index, or the wrong part of the stack.
How Slow Queries First Show Up in Production
An endpoint starts getting slower as traffic grows.
At first, the problem looks small. Median latency still seems acceptable, and most requests complete without obvious failures. Then p95 rises, connection pools stay busy longer, and application timeouts begin appearing in parts of the system that are not obviously related to the database.
At that point, teams often know a query is slow but still do not know why it is slow.
That gap matters. SQL performance work goes wrong when fixes are chosen from intuition alone:
- add an index because slow queries "usually need one"
- add a cache before understanding the access pattern
- move reads to replicas without checking the real bottleneck
- rewrite the ORM query without measuring what changed
The right approach is less dramatic. Treat slow query diagnosis as a production debugging problem.
What "Slow Query" Actually Means
A slow query is not simply a query with a large execution time in isolation.
In practice, a query becomes a problem when it does one or more of the following:
- reads far more rows than it returns
- holds connections long enough to reduce overall throughput
- creates contention under concurrency
- becomes unstable as data volume grows
- forces expensive sorts, joins, or repeated lookups
This distinction matters because a query that takes 150 ms once may be harmless, while a query that takes 40 ms but runs thousands of times per minute can become the main source of database pressure.
Always evaluate query cost in the context of:
- frequency
- concurrency
- rows scanned
- lock behavior
- overall system latency
Step 1: Find the Actual Query
Do not start with the code you suspect. Start with observed behavior.
Useful signals include:
- database slow query logs
- APM traces with database spans
- endpoint-level latency broken down by DB time
- ORM query logs in non-production environments
- query statistics views such as
pg_stat_statements
The goal is to identify:
- The exact SQL text or normalized query shape
- How often it runs
- Which endpoint, job, or workflow triggers it
- Whether the latency is stable or spikes under load
If you skip this step, you risk optimizing the wrong query or the right query in the wrong place.
Step 2: Measure More Than Duration
Execution time alone is not enough.
For each candidate query, capture:
- p50, p95, and p99 latency
- rows returned
- rows scanned or examined
- execution count
- average and peak concurrency
- buffer or cache hit behavior when available
A query that returns 20 rows but scans 500,000 is telling you something important long before users notice a total outage.
If you are using PostgreSQL, a practical workflow is:
SELECT
query,
calls,
mean_exec_time,
rows
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;
This is not the final diagnosis, but it quickly shows where time is accumulating.
Step 3: Read the Query Plan, Not Just the SQL
Once you identify the slow query, inspect the execution plan.
Use EXPLAIN to see the expected plan, and EXPLAIN ANALYZE in a safe environment to compare estimates with actual execution behavior.
Example:
EXPLAIN ANALYZE
SELECT id, user_id, total, created_at
FROM orders
WHERE status = 'completed'
AND created_at >= NOW() - INTERVAL '30 days'
ORDER BY created_at DESC
LIMIT 50;
When reviewing the plan, focus on questions like:
- Is the database doing a sequential scan or index scan?
- How many rows did it expect to read versus actually read?
- Is there an expensive sort step?
- Is a join exploding row count?
- Is one operation dominating total time?
The query plan is where slow-query debugging becomes concrete.
It is also where many indexing assumptions fall apart. If you have already seen cases where adding an index changed nothing, that is usually because the planner's cost model and the data distribution did not match developer intuition. I covered that failure mode in Why Database Indexes Didn’t Fix Your Slow Query.
Step 4: Check the Most Common Root Causes
Missing or weak indexes
Some queries are slow because the required index truly does not exist.
Others are slow because the index exists but does not narrow the search space enough, does not match the filter and sort pattern, or has become less useful as data distribution changed over time.
Look for:
- filters on unindexed columns
ORDER BYthat does not align with index order- composite indexes with the wrong leading columns
- low-selectivity predicates that still require scanning large portions of data
N+1 query behavior
Not all slow database problems come from one obviously expensive SQL statement.
Sometimes the issue is query multiplication: one request triggers dozens or hundreds of individually small queries. That pattern often hides behind ORMs and only becomes visible at scale. If that sounds familiar, see N+1 Query Problem in ORMs.
Expensive sorts and pagination
Queries that sort large result sets before applying LIMIT can degrade quickly as tables grow.
Offset pagination is another common issue:
SELECT id, created_at
FROM events
ORDER BY created_at DESC
LIMIT 50 OFFSET 10000;
The database still has to walk past the skipped rows. Under large offsets, work rises even though the response size stays small.
Keyset pagination often performs better because it turns "skip many rows" into "continue from a known boundary."
Join explosion
A query can appear reasonable at the SQL level while producing a much larger intermediate result than expected.
Watch for joins on columns with poor selectivity, missing join indexes, or predicates applied too late in the plan.
Stale assumptions about production data
Queries often behave well in development because:
- tables are smaller
- data distributions are unrealistic
- cardinality is low
- concurrency is limited
That is why production-only issues are common. The SQL did not change. The context did.
Step 5: Choose the Right Fix for the Actual Bottleneck
Once the cause is clear, the fix is usually one of a small number of patterns.
Fix query shape
Reduce the amount of work the query asks the database to perform.
Examples:
- select only required columns instead of
SELECT * - move filters earlier
- remove unnecessary joins
- replace repeated lookups with a batched query
- switch from offset pagination to keyset pagination
Fix indexing intentionally
Indexes should reflect the real access path, not just the columns that look important in code review.
A better index is often based on:
- exact filter columns
- sort order
- join conditions
- data selectivity in production
Adding indexes without verifying plan changes only creates maintenance cost and false confidence.
When the fix requires schema changes on large or hot tables, rollout safety matters as much as query design. A correct index can still cause production pain if it is added or enforced without considering locks, backfills, and mixed application versions. I covered that rollout problem in Safe Database Migrations in Production.
Fix application access patterns
If one page load issues 80 queries, the problem may not be a single bad statement. It may be an ORM usage pattern, resolver structure, or service call graph that multiplies database work.
Fix throughput, not just single-query latency
A query can become "fixed" in a benchmark while still hurting the system under concurrency because it holds locks, consumes memory for sorts, or keeps connections busy.
Always re-evaluate the endpoint or workflow after the query-level change.
If adding more application instances still does not improve throughput after query work, the next bottleneck may be coordination rather than SQL shape. I covered that pattern in Why Horizontal Scaling Didn’t Improve Throughput.
Step 6: Avoid Common Non-Fixes
Some responses feel productive but do not remove the root cause.
Adding caches too early
Cache can reduce read pressure, but it also adds invalidation complexity and can hide an unresolved query-shape problem. In some systems, that makes behavior harder to reason about rather than easier.
Moving reads to replicas without diagnosis
Read replicas can help, but they do not automatically remove the true bottleneck. If the workload is constrained by consistency, write coordination, or fallback-to-primary behavior, load may not move the way architecture diagrams suggest. I covered that in Why Read Replicas Didn’t Reduce Database Load.
Measuring only in development
Development datasets are too small to validate most SQL performance fixes.
If possible, benchmark against production-like cardinality and realistic query mixes.
For API endpoints that depend heavily on database behavior, it is also worth adding integration tests that verify query count, persistence behavior, and contract shape under realistic fixtures. That helps catch regressions before they turn back into production-only debugging. See How to Write API Integration Tests.
Shipping without before-and-after comparison
A performance fix without measurement is just a new assumption.
A Practical Production Workflow
When a slow query appears, use a sequence like this:
- Identify the highest-impact query by latency and frequency.
- Capture the current plan and system-level impact.
- Measure rows scanned, not just rows returned.
- Form one hypothesis about the bottleneck.
- Apply one change.
- Re-run the plan and compare endpoint latency.
- Check for side effects under concurrency.
- Add a regression test, dashboard, or query-budget guard where possible.
This workflow is intentionally narrow. It prevents "performance work" from turning into broad, risky rewrites.
Example: From Symptom to Fix
Suppose an orders endpoint slows down over time.
The query:
SELECT id, customer_id, total, created_at
FROM orders
WHERE account_id = $1
ORDER BY created_at DESC
LIMIT 100;
What you find:
- latency rises only for large accounts
- plan shows an expensive sort
- index exists on
account_id, but not on(account_id, created_at DESC) - endpoint also loads each customer separately afterward
What actually fixes it:
- add an index that matches the filter and sort
- batch customer loading instead of querying per row
- compare before/after query count and endpoint latency
This is a good example of why slow-query work is rarely about one magic change. The root cause is often a combination of plan shape and application behavior.
Closing Reflection
Most slow SQL queries are not mysterious. They are just difficult to reason about when teams look only at code and not at execution behavior.
The durable habit is simple:
- find the real query
- inspect the real plan
- measure the real cost
- change one thing at a time
That approach scales better than any single optimization technique.
If your database work keeps turning into guesswork, the problem is usually not lack of tools. It is lack of a consistent debugging process.