Trace Sampling in Production: What You Lose When You Sample Wrong

Trace Sampling in Production: What You Lose When You Sample Wrong

Trace sampling in production decides which distributed traces you keep and which traces disappear forever. The wrong sampling policy can make dashboards cheaper while hiding the exact slow request, failed checkout, missing background job, or cross-service timing gap you needed during an incident.

Sampling is not only a storage setting. It is a debugging trade-off.

This article is part of the Observability And Debugging hub. It builds on OpenTelemetry for Backend Engineers, Correlation IDs in Microservices, and Observability vs Logging in Production.


Why Sampling Exists

A busy service can create enormous trace volume.

If every request records every span across every service, the telemetry bill and storage volume can grow faster than the value of the data.

Sampling reduces that volume.

For example:

TrafficSpans per requestFull trace volume
2,000 requests/sec20 spans40,000 spans/sec
20,000 requests/sec20 spans400,000 spans/sec
100,000 requests/sec40 spans4,000,000 spans/sec

Sampling is reasonable.

The danger is treating all traces as equally disposable.

A random 1% sample might be enough to see the normal request path. It might not be enough to catch a rare payment timeout, a slow webhook replay, a tenant-specific authorization failure, or an async worker trace that only appears under retry pressure.

Sampling answers:

Which evidence are we willing to lose before we know whether it matters?


Head Sampling Drops Early

Head sampling makes the sampling decision near the start of the trace.

For example, sample 10% of new traces:

const provider = new NodeTracerProvider({
  sampler: new TraceIdRatioBasedSampler(0.1),
})

The OpenTelemetry JavaScript sampling docs describe TraceIdRatioBasedSampler as a deterministic percentage-based sampler, configurable through code or environment variables. See OpenTelemetry JavaScript sampling.

Head sampling is simple and cheap because unsampled traces usually do not record or export most spans.

But head sampling has one unavoidable limitation:

It decides before it knows whether the request will become interesting.

At the start of a request, you usually do not know whether it will:

  • return 500
  • exceed the latency SLO
  • hit a rare dependency path
  • retry a provider call three times
  • enqueue a job that later fails
  • cross a tenant boundary with bad permissions

If the root decision says "drop," that trace may be gone before the interesting part happens.


Parent-Based Sampling Preserves Trace Shape

Parent-based sampling means child spans follow the sampling decision of the parent trace.

That matters because partial traces can be misleading.

Bad:

checkout-api sampled
payment-api missing
receipt-worker sampled separately

Better:

checkout-api sampled
  -> payment-api sampled
  -> outbox relay sampled
  -> receipt-worker sampled

OpenTelemetry's trace SDK specification lists ParentBased(root=AlwaysOn) as the default sampler shape in the SDK specification, and language SDKs commonly recommend combining parent-based sampling with ratio-based root sampling for production.

The key point is practical:

Once a trace is sampled, keep the trace coherent across service boundaries.

Otherwise the trace may show only the easy part of the request and hide the downstream span that made it slow.

For propagation, the W3C Trace Context standard defines the traceparent header and its sampled flag. That is the metadata downstream services use to understand trace identity and sampling state.


Tail Sampling Decides Later

Tail sampling waits until more of the trace is known before deciding whether to keep it.

That enables policies like:

Keep trace when...Why
request status is 500Errors are high debugging value
latency exceeds 2 secondsSlow traces explain user pain
span has payment.provider.timeoutRare dependency failures matter
tenant is in a watchlistDebugging specific customer impact
route is low-volume but business-criticalRandom sampling may miss it

Tail sampling is powerful because it can keep unusual traces even when the normal path is heavily sampled.

The cost is operational complexity.

Tail sampling needs a collector or backend that can buffer spans long enough to evaluate the trace. It uses more memory, adds delay before export, and can still lose traces if the collector is overloaded.

Head sampling is cheaper. Tail sampling is more selective.

The right answer is often a combination:

head sample normal high-volume traces
always keep obvious errors
tail sample slow or rare traces
keep correlation IDs in logs for unsampled traces

What You Lose With A Flat 1% Sample

A flat sample rate treats all requests the same.

That is simple, but production traffic is not uniform.

Imagine this traffic:

RouteRequests/minProblem rate1% sample keeps
GET /feed100,000common latency noise~1,000 traces
POST /checkout2,000rare payment timeout~20 traces
POST /webhooks/payment200duplicate delivery bug~2 traces
POST /admin/refund20high business impactmaybe 0 traces

The flat sample gives you many traces for the loud route and few or none for the routes where one failure matters.

That is not wrong mathematically. It is wrong operationally if your incidents usually come from rare but important paths.

Use route-aware sampling when needed:

function shouldSampleRootSpan(route: string) {
  if (route.startsWith('/admin/refund')) return 1.0
  if (route.startsWith('/webhooks/payment')) return 0.25
  if (route.startsWith('/checkout')) return 0.1
  return 0.01
}

This does not mean every low-volume route gets 100% forever. It means sampling policy should reflect debugging value, not only request volume.


Keep Errors And Slow Requests

If you can tail sample, keep traces that are already known to be valuable:

  • failed requests
  • slow requests
  • dependency timeouts
  • retries over a threshold
  • queue processing that exceeds the lease
  • traces with unusual status transitions
  • traces with explicit debug headers from trusted internal tools

Example policy:

tail_sampling:
  policies:
    - name: keep-errors
      type: status_code
      status_codes: [ERROR]
    - name: keep-slow-checkout
      type: latency
      threshold_ms: 2000
    - name: sample-normal-traffic
      type: probabilistic
      sampling_percentage: 5

This shape preserves a baseline sample while protecting high-value evidence.

Do not rely on logs alone for slow distributed paths. Logs can tell you something happened. A trace can show where time moved across services.

That distinction is covered in Observability vs Logging in Production.


Async Work Needs Sampling Continuity

Sampling mistakes often appear at async boundaries:

HTTP request
  -> database transaction
  -> outbox row
  -> relay publishes event
  -> worker sends receipt email

If the outbox relay or worker starts a new unrelated trace, the incident timeline breaks.

For async work, propagate context in durable metadata:

{
  "eventType": "receipt.email_requested",
  "aggregateId": "ord_123",
  "payload": {
    "orderId": "ord_123"
  },
  "headers": {
    "traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01",
    "correlationId": "req_8a71"
  }
}

The worker extracts that context before creating its span.

Even if the trace is not sampled, keep a correlation ID in logs and events. That gives you a fallback investigation path when sampling drops the trace.

For the practical propagation pattern, see Correlation IDs in Microservices.


Sampling Needs Metrics Around It

You cannot judge a sampling policy by cost alone.

Track whether it preserves useful evidence.

Useful metrics:

MetricWhy it matters
Trace sample rate by routeShows whether important routes are under-sampled
Error traces kept vs total errorsShows whether failures remain debuggable
Slow traces kept vs total slow requestsShows whether latency incidents have evidence
Spans dropped by collectorShows collector overload or bad buffering
Unsampled logs with correlation IDsShows fallback investigation volume
Trace completeness by serviceShows broken propagation
Sampling cost by route/serviceShows where volume comes from

Also run incident reviews with one question:

Did sampling hide evidence we needed?

If the answer is yes, change the policy.

Sampling should evolve with traffic, incidents, and business criticality.


Practical Sampling Policy

A practical starting policy:

Traffic typeSampling approach
Normal high-volume successful requestsLow ratio sample
ErrorsKeep all or nearly all
Slow requestsKeep above threshold
Business-critical low-volume routesHigher fixed sample
Debug sessionsTemporary trusted forced sampling
Async workersParent/context-based continuity
Background maintenance jobsLow sample plus metrics

Do not start with one global percentage and call it done.

Start with the questions you need traces to answer:

  • Why did checkout slow down?
  • Which dependency is causing webhook retries?
  • Which tenant sees authorization failures?
  • Where did this background job lose context?
  • Did the outbox relay publish after the API committed?

Then shape sampling around those questions.


Checklist

Before changing production trace sampling, check:

  • Do sampled traces stay coherent across service boundaries?
  • Are errors kept at a higher rate than normal successes?
  • Are slow requests kept?
  • Are low-volume critical routes protected from disappearing?
  • Does async work propagate trace context or correlation IDs?
  • Can logs still be joined when a trace is unsampled?
  • Are collector drops visible?
  • Is the sample rate observable by route and service?
  • Did recent incidents have enough trace evidence?
  • Is cost reduction balanced against debugging value?

If sampling saves money but hides the next incident, it did not reduce cost. It moved the cost to debugging time.


Final Takeaway

Trace sampling is not just a telemetry-volume knob.

It decides what future-you will be able to see during an incident.

Use low sampling for ordinary high-volume paths, but protect errors, slow requests, rare business-critical flows, and async boundaries. Keep propagation intact, keep correlation IDs as fallback evidence, and measure whether your sampling policy is preserving the traces engineers actually need.