
Correlation IDs in Microservices
Correlation IDs in microservices help when production evidence is scattered across services, queues, workers, retries, and logs that do not naturally belong together.
The useful promise is simple: one logical operation should carry one stable handle through the places engineers will need to inspect later.
The mistake is expecting that handle to explain everything.
A correlation ID can group related evidence. It cannot, by itself, explain timing, causality, saturation, retry amplification, missing state transitions, or where latency accumulated. For those questions, correlation IDs work best alongside metrics, traces, and useful local logs.
For the broader telemetry model behind that split, see Observability vs Logging in Production. This article focuses on the practical part: how to propagate correlation context through real backend boundaries without creating a false sense of observability.
The Production Problem
A user clicks "Place order."
The request enters checkout-api, which calls payment-api, reserves inventory, writes an outbox event, publishes a receipt job, and later receives a provider webhook. The user reports that the order was charged but the receipt never arrived.
The evidence exists, but it is scattered:
checkout-api: order created
payment-api: provider authorization succeeded
inventory-api: reservation completed
outbox-worker: receipt event published
receipt-worker: email delivery failed
webhook-api: provider callback accepted
Each service tells a local truth. The incident question is cross-boundary:
Which logs, jobs, provider calls, and state changes belonged to this one checkout attempt?
Without shared context, engineers search by timestamp, user ID, order ID, provider ID, queue message ID, or whatever field happens to be available. That works until retries, replays, delayed jobs, or duplicate webhook deliveries make several workflows overlap.
A correlation ID gives the investigation a stable starting point:
correlation_id=checkout-01J3M0K0G6V7C8E9R2S1T0Q5PZ
Now the team can ask for all evidence attached to that operation.
That does not solve the incident. It makes the evidence set smaller and less speculative.
What A Correlation ID Is
A correlation ID is an identifier carried across boundaries so related telemetry and records can be grouped later.
The boundary might be:
- an inbound HTTP request
- an outbound HTTP or gRPC call
- a queue message
- a background job
- an outbox event
- a scheduled retry
- a webhook handler
- a dead-letter replay
In older systems, teams often created their own X-Request-ID or X-Correlation-ID header. In systems with distributed tracing, the trace ID often becomes the best correlation key.
W3C Trace Context defines standard HTTP headers for distributed trace propagation. The traceparent header carries fields such as the trace-id, parent-id, and trace-flags, while tracestate carries vendor-specific context. See the W3C Trace Context recommendation.
OpenTelemetry builds on that model. Its context propagation documentation explains that context lets signals such as traces, metrics, and logs be correlated across process and network boundaries, and that the default propagator uses W3C Trace Context headers. See OpenTelemetry's context propagation documentation.
The practical rule is:
Use one cross-boundary context that engineers can search and follow during an incident.
If you already have trace context, prefer using it rather than inventing a second unrelated identifier. If you do not have tracing yet, a correlation ID still helps as a transitional step.
Correlation ID vs Request ID vs Trace ID
These terms often get mixed together.
They can point to the same value, but they do not mean exactly the same thing.
| Identifier | Scope | Useful for |
|---|---|---|
| request ID | one inbound request to one service | local service logs and support lookup |
| correlation ID | one logical workflow across boundaries | grouping logs, jobs, messages, and related records |
| trace ID | one distributed trace in a tracing system | connecting spans and trace-aware logs |
| span ID | one operation inside a trace | parent/child relationships and timing inside a trace |
| domain ID | a business object such as orderId | product and support workflows, not telemetry correlation alone |
A common failure is using a local request ID as if it were a workflow ID.
For example:
checkout-api request_id=req-111 correlation_id=checkout-abc
payment-api request_id=req-222 correlation_id=checkout-abc
receipt-worker job_id=job-333 correlation_id=checkout-abc
Each component can keep a local request or job ID. The correlation ID stays stable across the workflow.
In a tracing setup, the trace ID can fill that stable role:
trace_id=4bf92f3577b34da6a3ce929d0e0e4736
span_id=00f067aa0ba902b7
The important part is not the name. It is the propagation rule.
A Minimal HTTP Implementation
For an HTTP service, correlation usually starts at the edge.
The service should:
- accept a trusted incoming context or create a new one
- attach it to request-local state
- include it in logs
- propagate it to downstream calls
- return it in the response when useful for support
Here is a simplified TypeScript shape:
import { randomUUID } from 'crypto'
const CORRELATION_HEADER = 'x-correlation-id'
export function getOrCreateCorrelationId(headers: Headers) {
const incoming = headers.get(CORRELATION_HEADER)
if (incoming && isSafeCorrelationId(incoming)) {
return incoming
}
return randomUUID()
}
function isSafeCorrelationId(value: string) {
return /^[a-zA-Z0-9._:-]{8,128}$/.test(value)
}
Use it at the entry point:
export async function createCheckout(req: Request) {
const correlationId = getOrCreateCorrelationId(req.headers)
logger.info('Checkout started', {
correlationId,
operation: 'checkout.create',
})
const payment = await paymentClient.authorize({
amount: 4200,
currency: 'USD',
correlationId,
})
await receiptQueue.publish({
type: 'receipt.send',
orderId: payment.orderId,
correlationId,
})
logger.info('Checkout completed', {
correlationId,
operation: 'checkout.create',
orderId: payment.orderId,
})
return Response.json(
{ orderId: payment.orderId },
{
headers: {
[CORRELATION_HEADER]: correlationId,
},
}
)
}
Then propagate it on outbound calls:
export async function authorizePayment(input: {
amount: number
currency: string
correlationId: string
}) {
return fetch('https://payment-api.internal/authorize', {
method: 'POST',
headers: {
'content-type': 'application/json',
'x-correlation-id': input.correlationId,
},
body: JSON.stringify({
amount: input.amount,
currency: input.currency,
}),
})
}
This is not a full tracing implementation. It is the minimum useful operational contract: do not drop the context at service boundaries.
Propagating Through Queues And Jobs
Correlation usually breaks when work leaves the synchronous request path.
HTTP propagation is visible because headers are part of the request code. Queue propagation is easier to forget because the consumer runs later, often in a different process, deployment, or team-owned service.
For queue systems, put correlation context in message metadata when the platform supports it.
Amazon SQS, for example, supports message attributes for custom metadata, and each message can have up to 10 attributes. See the AWS documentation on SQS message metadata.
A simplified publisher:
export async function publishReceiptJob(input: {
orderId: string
correlationId: string
}) {
await sqs.sendMessage({
QueueUrl: receiptQueueUrl,
MessageBody: JSON.stringify({
type: 'receipt.send',
orderId: input.orderId,
}),
MessageAttributes: {
correlationId: {
DataType: 'String',
StringValue: input.correlationId,
},
},
})
}
And the worker should restore the context before logging:
export async function handleReceiptMessage(message: SqsMessage) {
const correlationId =
message.MessageAttributes?.correlationId?.StringValue ?? createWorkerCorrelationId()
logger.info('Receipt job started', {
correlationId,
jobId: message.MessageId,
operation: 'receipt.send',
})
try {
await sendReceipt(message)
logger.info('Receipt job completed', {
correlationId,
jobId: message.MessageId,
operation: 'receipt.send',
})
} catch (error) {
logger.warn('Receipt job failed', {
correlationId,
jobId: message.MessageId,
operation: 'receipt.send',
errorName: error.name,
})
throw error
}
}
The fallback ID is still useful because it keeps the worker logs searchable. But a fallback should be visible. If many jobs create fallback IDs, propagation is broken upstream.
That should become a metric:
receipt_worker_missing_correlation_context_total
Correlation is not only a logging convention. It is a production contract you can test and monitor.
What To Put In Logs
A correlation ID is only a join key. The log still needs useful local meaning.
Prefer logs shaped around operations and outcomes:
{
"level": "info",
"message": "Payment authorization completed",
"correlationId": "checkout-01J3M0K0G6V7C8E9R2S1T0Q5PZ",
"operation": "payment.authorize",
"orderId": "ord_123",
"provider": "stripe",
"attempts": 2,
"durationMs": 1180,
"outcome": "authorized"
}
Avoid logs that provide only activity:
{
"message": "Calling payment service",
"correlationId": "checkout-01J3M0K0G6V7C8E9R2S1T0Q5PZ"
}
The second log can help sometimes, but it does not say what happened.
Good correlated logs usually include:
correlationIdortraceIdoperationservice- local request, job, or message ID
- safe domain identifier such as
orderId - outcome or state transition
- retry attempt or final attempt count
- error class when relevant
Do not index every field as a log label. High-cardinality values such as request IDs, user IDs, order IDs, and trace IDs need careful storage and query design. The logging-overload side of that trade-off is covered in Too Much Logging in Production Breaks Debugging.
Where Correlation IDs Usually Break
Most correlation failures are not mysterious.
They happen at boundaries.
| Boundary | Common break | Fix |
|---|---|---|
| API gateway to service | gateway creates ID but service logs another field | standardize field names and forwarding |
| service to service | outbound client forgets header injection | wrap clients or use instrumentation |
| service to queue | message body has ID but metadata does not | put context in message attributes or envelope |
| queue to worker | worker logs job ID only | restore correlation context before logging |
| retry | retry creates new ID | preserve original workflow ID and add attempt metadata |
| dead-letter queue | DLQ message loses attributes | copy correlation metadata into DLQ and replay path |
| webhook | provider callback has provider ID but no internal context | map provider event to internal operation record |
| batch job | one job handles many entities | use batch correlation plus per-item IDs |
The hardest cases are not the happy paths.
They are retries, replays, dead-letter handling, scheduled jobs, and fan-out.
If an order event fans out to three workers, those workers should usually share the same root correlation context while also logging their own job IDs:
correlationId=checkout-abc jobId=receipt-1 operation=receipt.send
correlationId=checkout-abc jobId=crm-1 operation=crm.sync
correlationId=checkout-abc jobId=analytics-1 operation=analytics.track
That lets engineers see both the shared workflow and each branch.
Correlation IDs Do Not Replace Tracing
Correlation IDs answer:
What evidence belongs together?
Tracing answers more:
How did the operation move through the system, and where did time accumulate?
With only a correlation ID, you may find all logs for a checkout. You still may not know whether latency came from payment, inventory, a queue delay, a lock, or a retry.
That is why correlation IDs are a step toward observability, not the destination.
OpenTelemetry's context propagation docs explain that traces can build causal information across distributed services when trace ID and span ID context is propagated. The logs section also notes that SDKs can correlate logs with traces by injecting trace and span IDs into log records.
So when tracing is available, prefer this shape:
trace_id stable across the distributed trace
span_id current operation in the trace
correlation_id optional workflow or business operation ID
order_id domain object ID
job_id local worker job ID
Sometimes trace_id and correlation_id can be the same practical handle. Sometimes they should stay separate because one business workflow may involve several traces over time, especially with delayed jobs, external webhooks, or replayed messages.
The rule is not "one ID forever."
The rule is "make the relationship explicit."
Security And Trust Boundaries
Correlation context crosses boundaries, so treat it like input.
OpenTelemetry warns that context propagation has security implications: incoming context from untrusted sources can be forged, and outgoing context may expose internal trace IDs, span IDs, or baggage to services you do not own. Its baggage documentation also warns that baggage can be shared with unintended resources and does not include built-in integrity checks. See OpenTelemetry's baggage documentation.
Practical rules:
- validate incoming custom correlation IDs
- generate a new trusted ID at public boundaries when needed
- do not put email addresses, tokens, user secrets, or raw PII in correlation context
- avoid sending internal context to third-party APIs unless you intend to
- treat externally supplied IDs as labels for grouping, not proof of identity
- keep domain identifiers separate from authentication and authorization decisions
A correlation ID should help debugging.
It should not become a security primitive.
How To Test Propagation
Correlation context should be tested like any other production contract.
A useful integration test does not need a real logging backend. It can verify propagation through the code path.
it('propagates correlation context from checkout to payment and receipt job', async () => {
const correlationId = 'test-correlation-123'
await api.post('/checkout', {
headers: {
'x-correlation-id': correlationId,
},
body: {
cartId: 'cart_123',
},
})
expect(paymentApi.lastRequest.headers['x-correlation-id']).toBe(correlationId)
expect(receiptQueue.lastMessage.attributes.correlationId).toBe(correlationId)
})
Add separate tests for failure paths:
| Path | Test |
|---|---|
| no incoming ID | service generates one and includes it in logs/response |
| unsafe incoming ID | service rejects or replaces it |
| outbound HTTP call | header is forwarded |
| queue publish | message metadata contains the ID |
| worker failure | error log includes the same ID |
| retry | original ID is preserved and attempt count changes |
| DLQ replay | replay keeps original correlation context plus replay metadata |
If you cannot test the logging backend directly, test the structured logger call or context object that feeds it.
The goal is to prevent silent context loss.
Rollout Checklist
Start with one critical workflow.
For example: checkout, signup, payment authorization, webhook processing, or order fulfillment.
- Pick the canonical field name:
correlationId,traceId, or another standard already used by the platform. - Decide where context is trusted, replaced, or sanitized.
- Add context at the public entry point.
- Forward it on internal HTTP/gRPC calls.
- Store it in queue message metadata or an event envelope.
- Restore it before worker logging.
- Add it to state-transition logs.
- Add tests for HTTP, queue, retry, and DLQ paths.
- Add a metric for missing context at boundaries.
- Use it during one real incident or staging failure before expanding.
Do not begin by adding correlation fields to every log in every service.
Begin with one workflow where debugging currently requires timestamp guessing. Prove that the investigation gets shorter. Then expand.
Final Takeaway
Correlation IDs in microservices are valuable because they make scattered evidence belong to one operation again.
They are not magic. They do not reconstruct causality, explain latency, or replace traces and metrics. They only give engineers a stable handle for grouping evidence across boundaries.
That handle becomes powerful when it is propagated consistently through HTTP calls, queue messages, workers, retries, dead-letter paths, and logs with real local meaning.
The practical goal is simple: when production behavior is unclear, an engineer should not have to guess which records belong together before they can start debugging.