
Trace Sampling in Production: What You Lose When You Sample Wrong
Trace sampling in production decides which distributed traces you keep and which traces disappear forever. The wrong sampling policy can make dashboards cheaper while hiding the exact slow request, failed checkout, missing background job, or cross-service timing gap you needed during an incident.
Sampling is not only a storage setting. It is a debugging trade-off.
This article is part of the Observability And Debugging hub. It builds on OpenTelemetry for Backend Engineers, Correlation IDs in Microservices, and Observability vs Logging in Production.
Why Sampling Exists
A busy service can create enormous trace volume.
If every request records every span across every service, the telemetry bill and storage volume can grow faster than the value of the data.
Sampling reduces that volume.
For example:
| Traffic | Spans per request | Full trace volume |
|---|---|---|
| 2,000 requests/sec | 20 spans | 40,000 spans/sec |
| 20,000 requests/sec | 20 spans | 400,000 spans/sec |
| 100,000 requests/sec | 40 spans | 4,000,000 spans/sec |
Sampling is reasonable.
The danger is treating all traces as equally disposable.
A random 1% sample might be enough to see the normal request path. It might not be enough to catch a rare payment timeout, a slow webhook replay, a tenant-specific authorization failure, or an async worker trace that only appears under retry pressure.
Sampling answers:
Which evidence are we willing to lose before we know whether it matters?
Head Sampling Drops Early
Head sampling makes the sampling decision near the start of the trace.
For example, sample 10% of new traces:
const provider = new NodeTracerProvider({
sampler: new TraceIdRatioBasedSampler(0.1),
})
The OpenTelemetry JavaScript sampling docs describe TraceIdRatioBasedSampler as a deterministic percentage-based sampler, configurable through code or environment variables. See OpenTelemetry JavaScript sampling.
Head sampling is simple and cheap because unsampled traces usually do not record or export most spans.
But head sampling has one unavoidable limitation:
It decides before it knows whether the request will become interesting.
At the start of a request, you usually do not know whether it will:
- return
500 - exceed the latency SLO
- hit a rare dependency path
- retry a provider call three times
- enqueue a job that later fails
- cross a tenant boundary with bad permissions
If the root decision says "drop," that trace may be gone before the interesting part happens.
Parent-Based Sampling Preserves Trace Shape
Parent-based sampling means child spans follow the sampling decision of the parent trace.
That matters because partial traces can be misleading.
Bad:
checkout-api sampled
payment-api missing
receipt-worker sampled separately
Better:
checkout-api sampled
-> payment-api sampled
-> outbox relay sampled
-> receipt-worker sampled
OpenTelemetry's trace SDK specification lists ParentBased(root=AlwaysOn) as the default sampler shape in the SDK specification, and language SDKs commonly recommend combining parent-based sampling with ratio-based root sampling for production.
The key point is practical:
Once a trace is sampled, keep the trace coherent across service boundaries.
Otherwise the trace may show only the easy part of the request and hide the downstream span that made it slow.
For propagation, the W3C Trace Context standard defines the traceparent header and its sampled flag. That is the metadata downstream services use to understand trace identity and sampling state.
Tail Sampling Decides Later
Tail sampling waits until more of the trace is known before deciding whether to keep it.
That enables policies like:
| Keep trace when... | Why |
|---|---|
request status is 500 | Errors are high debugging value |
| latency exceeds 2 seconds | Slow traces explain user pain |
span has payment.provider.timeout | Rare dependency failures matter |
| tenant is in a watchlist | Debugging specific customer impact |
| route is low-volume but business-critical | Random sampling may miss it |
Tail sampling is powerful because it can keep unusual traces even when the normal path is heavily sampled.
The cost is operational complexity.
Tail sampling needs a collector or backend that can buffer spans long enough to evaluate the trace. It uses more memory, adds delay before export, and can still lose traces if the collector is overloaded.
Head sampling is cheaper. Tail sampling is more selective.
The right answer is often a combination:
head sample normal high-volume traces
always keep obvious errors
tail sample slow or rare traces
keep correlation IDs in logs for unsampled traces
What You Lose With A Flat 1% Sample
A flat sample rate treats all requests the same.
That is simple, but production traffic is not uniform.
Imagine this traffic:
| Route | Requests/min | Problem rate | 1% sample keeps |
|---|---|---|---|
GET /feed | 100,000 | common latency noise | ~1,000 traces |
POST /checkout | 2,000 | rare payment timeout | ~20 traces |
POST /webhooks/payment | 200 | duplicate delivery bug | ~2 traces |
POST /admin/refund | 20 | high business impact | maybe 0 traces |
The flat sample gives you many traces for the loud route and few or none for the routes where one failure matters.
That is not wrong mathematically. It is wrong operationally if your incidents usually come from rare but important paths.
Use route-aware sampling when needed:
function shouldSampleRootSpan(route: string) {
if (route.startsWith('/admin/refund')) return 1.0
if (route.startsWith('/webhooks/payment')) return 0.25
if (route.startsWith('/checkout')) return 0.1
return 0.01
}
This does not mean every low-volume route gets 100% forever. It means sampling policy should reflect debugging value, not only request volume.
Keep Errors And Slow Requests
If you can tail sample, keep traces that are already known to be valuable:
- failed requests
- slow requests
- dependency timeouts
- retries over a threshold
- queue processing that exceeds the lease
- traces with unusual status transitions
- traces with explicit debug headers from trusted internal tools
Example policy:
tail_sampling:
policies:
- name: keep-errors
type: status_code
status_codes: [ERROR]
- name: keep-slow-checkout
type: latency
threshold_ms: 2000
- name: sample-normal-traffic
type: probabilistic
sampling_percentage: 5
This shape preserves a baseline sample while protecting high-value evidence.
Do not rely on logs alone for slow distributed paths. Logs can tell you something happened. A trace can show where time moved across services.
That distinction is covered in Observability vs Logging in Production.
Async Work Needs Sampling Continuity
Sampling mistakes often appear at async boundaries:
HTTP request
-> database transaction
-> outbox row
-> relay publishes event
-> worker sends receipt email
If the outbox relay or worker starts a new unrelated trace, the incident timeline breaks.
For async work, propagate context in durable metadata:
{
"eventType": "receipt.email_requested",
"aggregateId": "ord_123",
"payload": {
"orderId": "ord_123"
},
"headers": {
"traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01",
"correlationId": "req_8a71"
}
}
The worker extracts that context before creating its span.
Even if the trace is not sampled, keep a correlation ID in logs and events. That gives you a fallback investigation path when sampling drops the trace.
For the practical propagation pattern, see Correlation IDs in Microservices.
Sampling Needs Metrics Around It
You cannot judge a sampling policy by cost alone.
Track whether it preserves useful evidence.
Useful metrics:
| Metric | Why it matters |
|---|---|
| Trace sample rate by route | Shows whether important routes are under-sampled |
| Error traces kept vs total errors | Shows whether failures remain debuggable |
| Slow traces kept vs total slow requests | Shows whether latency incidents have evidence |
| Spans dropped by collector | Shows collector overload or bad buffering |
| Unsampled logs with correlation IDs | Shows fallback investigation volume |
| Trace completeness by service | Shows broken propagation |
| Sampling cost by route/service | Shows where volume comes from |
Also run incident reviews with one question:
Did sampling hide evidence we needed?
If the answer is yes, change the policy.
Sampling should evolve with traffic, incidents, and business criticality.
Practical Sampling Policy
A practical starting policy:
| Traffic type | Sampling approach |
|---|---|
| Normal high-volume successful requests | Low ratio sample |
| Errors | Keep all or nearly all |
| Slow requests | Keep above threshold |
| Business-critical low-volume routes | Higher fixed sample |
| Debug sessions | Temporary trusted forced sampling |
| Async workers | Parent/context-based continuity |
| Background maintenance jobs | Low sample plus metrics |
Do not start with one global percentage and call it done.
Start with the questions you need traces to answer:
- Why did checkout slow down?
- Which dependency is causing webhook retries?
- Which tenant sees authorization failures?
- Where did this background job lose context?
- Did the outbox relay publish after the API committed?
Then shape sampling around those questions.
Checklist
Before changing production trace sampling, check:
- Do sampled traces stay coherent across service boundaries?
- Are errors kept at a higher rate than normal successes?
- Are slow requests kept?
- Are low-volume critical routes protected from disappearing?
- Does async work propagate trace context or correlation IDs?
- Can logs still be joined when a trace is unsampled?
- Are collector drops visible?
- Is the sample rate observable by route and service?
- Did recent incidents have enough trace evidence?
- Is cost reduction balanced against debugging value?
If sampling saves money but hides the next incident, it did not reduce cost. It moved the cost to debugging time.
Final Takeaway
Trace sampling is not just a telemetry-volume knob.
It decides what future-you will be able to see during an incident.
Use low sampling for ordinary high-volume paths, but protect errors, slow requests, rare business-critical flows, and async boundaries. Keep propagation intact, keep correlation IDs as fallback evidence, and measure whether your sampling policy is preserving the traces engineers actually need.