Correlation IDs in Microservices

Correlation IDs in Microservices

Correlation IDs in microservices help when production evidence is scattered across services, queues, workers, retries, and logs that do not naturally belong together.

The useful promise is simple: one logical operation should carry one stable handle through the places engineers will need to inspect later.

The mistake is expecting that handle to explain everything.

A correlation ID can group related evidence. It cannot, by itself, explain timing, causality, saturation, retry amplification, missing state transitions, or where latency accumulated. For those questions, correlation IDs work best alongside metrics, traces, and useful local logs.

For the broader telemetry model behind that split, see Observability vs Logging in Production. This article focuses on the practical part: how to propagate correlation context through real backend boundaries without creating a false sense of observability.


The Production Problem

A user clicks "Place order."

The request enters checkout-api, which calls payment-api, reserves inventory, writes an outbox event, publishes a receipt job, and later receives a provider webhook. The user reports that the order was charged but the receipt never arrived.

The evidence exists, but it is scattered:

checkout-api: order created
payment-api: provider authorization succeeded
inventory-api: reservation completed
outbox-worker: receipt event published
receipt-worker: email delivery failed
webhook-api: provider callback accepted

Each service tells a local truth. The incident question is cross-boundary:

Which logs, jobs, provider calls, and state changes belonged to this one checkout attempt?

Without shared context, engineers search by timestamp, user ID, order ID, provider ID, queue message ID, or whatever field happens to be available. That works until retries, replays, delayed jobs, or duplicate webhook deliveries make several workflows overlap.

A correlation ID gives the investigation a stable starting point:

correlation_id=checkout-01J3M0K0G6V7C8E9R2S1T0Q5PZ

Now the team can ask for all evidence attached to that operation.

That does not solve the incident. It makes the evidence set smaller and less speculative.


What A Correlation ID Is

A correlation ID is an identifier carried across boundaries so related telemetry and records can be grouped later.

The boundary might be:

  • an inbound HTTP request
  • an outbound HTTP or gRPC call
  • a queue message
  • a background job
  • an outbox event
  • a scheduled retry
  • a webhook handler
  • a dead-letter replay

In older systems, teams often created their own X-Request-ID or X-Correlation-ID header. In systems with distributed tracing, the trace ID often becomes the best correlation key.

W3C Trace Context defines standard HTTP headers for distributed trace propagation. The traceparent header carries fields such as the trace-id, parent-id, and trace-flags, while tracestate carries vendor-specific context. See the W3C Trace Context recommendation.

OpenTelemetry builds on that model. Its context propagation documentation explains that context lets signals such as traces, metrics, and logs be correlated across process and network boundaries, and that the default propagator uses W3C Trace Context headers. See OpenTelemetry's context propagation documentation.

The practical rule is:

Use one cross-boundary context that engineers can search and follow during an incident.

If you already have trace context, prefer using it rather than inventing a second unrelated identifier. If you do not have tracing yet, a correlation ID still helps as a transitional step.


Correlation ID vs Request ID vs Trace ID

These terms often get mixed together.

They can point to the same value, but they do not mean exactly the same thing.

IdentifierScopeUseful for
request IDone inbound request to one servicelocal service logs and support lookup
correlation IDone logical workflow across boundariesgrouping logs, jobs, messages, and related records
trace IDone distributed trace in a tracing systemconnecting spans and trace-aware logs
span IDone operation inside a traceparent/child relationships and timing inside a trace
domain IDa business object such as orderIdproduct and support workflows, not telemetry correlation alone

A common failure is using a local request ID as if it were a workflow ID.

For example:

checkout-api request_id=req-111 correlation_id=checkout-abc
payment-api  request_id=req-222 correlation_id=checkout-abc
receipt-worker job_id=job-333 correlation_id=checkout-abc

Each component can keep a local request or job ID. The correlation ID stays stable across the workflow.

In a tracing setup, the trace ID can fill that stable role:

trace_id=4bf92f3577b34da6a3ce929d0e0e4736
span_id=00f067aa0ba902b7

The important part is not the name. It is the propagation rule.


A Minimal HTTP Implementation

For an HTTP service, correlation usually starts at the edge.

The service should:

  1. accept a trusted incoming context or create a new one
  2. attach it to request-local state
  3. include it in logs
  4. propagate it to downstream calls
  5. return it in the response when useful for support

Here is a simplified TypeScript shape:

correlation.ts
import { randomUUID } from 'crypto'

const CORRELATION_HEADER = 'x-correlation-id'

export function getOrCreateCorrelationId(headers: Headers) {
  const incoming = headers.get(CORRELATION_HEADER)

  if (incoming && isSafeCorrelationId(incoming)) {
    return incoming
  }

  return randomUUID()
}

function isSafeCorrelationId(value: string) {
  return /^[a-zA-Z0-9._:-]{8,128}$/.test(value)
}

Use it at the entry point:

checkout-handler.ts
export async function createCheckout(req: Request) {
  const correlationId = getOrCreateCorrelationId(req.headers)

  logger.info('Checkout started', {
    correlationId,
    operation: 'checkout.create',
  })

  const payment = await paymentClient.authorize({
    amount: 4200,
    currency: 'USD',
    correlationId,
  })

  await receiptQueue.publish({
    type: 'receipt.send',
    orderId: payment.orderId,
    correlationId,
  })

  logger.info('Checkout completed', {
    correlationId,
    operation: 'checkout.create',
    orderId: payment.orderId,
  })

  return Response.json(
    { orderId: payment.orderId },
    {
      headers: {
        [CORRELATION_HEADER]: correlationId,
      },
    }
  )
}

Then propagate it on outbound calls:

payment-client.ts
export async function authorizePayment(input: {
  amount: number
  currency: string
  correlationId: string
}) {
  return fetch('https://payment-api.internal/authorize', {
    method: 'POST',
    headers: {
      'content-type': 'application/json',
      'x-correlation-id': input.correlationId,
    },
    body: JSON.stringify({
      amount: input.amount,
      currency: input.currency,
    }),
  })
}

This is not a full tracing implementation. It is the minimum useful operational contract: do not drop the context at service boundaries.


Propagating Through Queues And Jobs

Correlation usually breaks when work leaves the synchronous request path.

HTTP propagation is visible because headers are part of the request code. Queue propagation is easier to forget because the consumer runs later, often in a different process, deployment, or team-owned service.

For queue systems, put correlation context in message metadata when the platform supports it.

Amazon SQS, for example, supports message attributes for custom metadata, and each message can have up to 10 attributes. See the AWS documentation on SQS message metadata.

A simplified publisher:

receipt-queue.ts
export async function publishReceiptJob(input: {
  orderId: string
  correlationId: string
}) {
  await sqs.sendMessage({
    QueueUrl: receiptQueueUrl,
    MessageBody: JSON.stringify({
      type: 'receipt.send',
      orderId: input.orderId,
    }),
    MessageAttributes: {
      correlationId: {
        DataType: 'String',
        StringValue: input.correlationId,
      },
    },
  })
}

And the worker should restore the context before logging:

receipt-worker.ts
export async function handleReceiptMessage(message: SqsMessage) {
  const correlationId =
    message.MessageAttributes?.correlationId?.StringValue ?? createWorkerCorrelationId()

  logger.info('Receipt job started', {
    correlationId,
    jobId: message.MessageId,
    operation: 'receipt.send',
  })

  try {
    await sendReceipt(message)

    logger.info('Receipt job completed', {
      correlationId,
      jobId: message.MessageId,
      operation: 'receipt.send',
    })
  } catch (error) {
    logger.warn('Receipt job failed', {
      correlationId,
      jobId: message.MessageId,
      operation: 'receipt.send',
      errorName: error.name,
    })

    throw error
  }
}

The fallback ID is still useful because it keeps the worker logs searchable. But a fallback should be visible. If many jobs create fallback IDs, propagation is broken upstream.

That should become a metric:

receipt_worker_missing_correlation_context_total

Correlation is not only a logging convention. It is a production contract you can test and monitor.


What To Put In Logs

A correlation ID is only a join key. The log still needs useful local meaning.

Prefer logs shaped around operations and outcomes:

{
  "level": "info",
  "message": "Payment authorization completed",
  "correlationId": "checkout-01J3M0K0G6V7C8E9R2S1T0Q5PZ",
  "operation": "payment.authorize",
  "orderId": "ord_123",
  "provider": "stripe",
  "attempts": 2,
  "durationMs": 1180,
  "outcome": "authorized"
}

Avoid logs that provide only activity:

{
  "message": "Calling payment service",
  "correlationId": "checkout-01J3M0K0G6V7C8E9R2S1T0Q5PZ"
}

The second log can help sometimes, but it does not say what happened.

Good correlated logs usually include:

  • correlationId or traceId
  • operation
  • service
  • local request, job, or message ID
  • safe domain identifier such as orderId
  • outcome or state transition
  • retry attempt or final attempt count
  • error class when relevant

Do not index every field as a log label. High-cardinality values such as request IDs, user IDs, order IDs, and trace IDs need careful storage and query design. The logging-overload side of that trade-off is covered in Too Much Logging in Production Breaks Debugging.


Where Correlation IDs Usually Break

Most correlation failures are not mysterious.

They happen at boundaries.

BoundaryCommon breakFix
API gateway to servicegateway creates ID but service logs another fieldstandardize field names and forwarding
service to serviceoutbound client forgets header injectionwrap clients or use instrumentation
service to queuemessage body has ID but metadata does notput context in message attributes or envelope
queue to workerworker logs job ID onlyrestore correlation context before logging
retryretry creates new IDpreserve original workflow ID and add attempt metadata
dead-letter queueDLQ message loses attributescopy correlation metadata into DLQ and replay path
webhookprovider callback has provider ID but no internal contextmap provider event to internal operation record
batch jobone job handles many entitiesuse batch correlation plus per-item IDs

The hardest cases are not the happy paths.

They are retries, replays, dead-letter handling, scheduled jobs, and fan-out.

If an order event fans out to three workers, those workers should usually share the same root correlation context while also logging their own job IDs:

correlationId=checkout-abc jobId=receipt-1 operation=receipt.send
correlationId=checkout-abc jobId=crm-1 operation=crm.sync
correlationId=checkout-abc jobId=analytics-1 operation=analytics.track

That lets engineers see both the shared workflow and each branch.


Correlation IDs Do Not Replace Tracing

Correlation IDs answer:

What evidence belongs together?

Tracing answers more:

How did the operation move through the system, and where did time accumulate?

With only a correlation ID, you may find all logs for a checkout. You still may not know whether latency came from payment, inventory, a queue delay, a lock, or a retry.

That is why correlation IDs are a step toward observability, not the destination.

OpenTelemetry's context propagation docs explain that traces can build causal information across distributed services when trace ID and span ID context is propagated. The logs section also notes that SDKs can correlate logs with traces by injecting trace and span IDs into log records.

So when tracing is available, prefer this shape:

trace_id       stable across the distributed trace
span_id        current operation in the trace
correlation_id optional workflow or business operation ID
order_id       domain object ID
job_id         local worker job ID

Sometimes trace_id and correlation_id can be the same practical handle. Sometimes they should stay separate because one business workflow may involve several traces over time, especially with delayed jobs, external webhooks, or replayed messages.

The rule is not "one ID forever."

The rule is "make the relationship explicit."


Security And Trust Boundaries

Correlation context crosses boundaries, so treat it like input.

OpenTelemetry warns that context propagation has security implications: incoming context from untrusted sources can be forged, and outgoing context may expose internal trace IDs, span IDs, or baggage to services you do not own. Its baggage documentation also warns that baggage can be shared with unintended resources and does not include built-in integrity checks. See OpenTelemetry's baggage documentation.

Practical rules:

  • validate incoming custom correlation IDs
  • generate a new trusted ID at public boundaries when needed
  • do not put email addresses, tokens, user secrets, or raw PII in correlation context
  • avoid sending internal context to third-party APIs unless you intend to
  • treat externally supplied IDs as labels for grouping, not proof of identity
  • keep domain identifiers separate from authentication and authorization decisions

A correlation ID should help debugging.

It should not become a security primitive.


How To Test Propagation

Correlation context should be tested like any other production contract.

A useful integration test does not need a real logging backend. It can verify propagation through the code path.

correlation.test.ts
it('propagates correlation context from checkout to payment and receipt job', async () => {
  const correlationId = 'test-correlation-123'

  await api.post('/checkout', {
    headers: {
      'x-correlation-id': correlationId,
    },
    body: {
      cartId: 'cart_123',
    },
  })

  expect(paymentApi.lastRequest.headers['x-correlation-id']).toBe(correlationId)

  expect(receiptQueue.lastMessage.attributes.correlationId).toBe(correlationId)
})

Add separate tests for failure paths:

PathTest
no incoming IDservice generates one and includes it in logs/response
unsafe incoming IDservice rejects or replaces it
outbound HTTP callheader is forwarded
queue publishmessage metadata contains the ID
worker failureerror log includes the same ID
retryoriginal ID is preserved and attempt count changes
DLQ replayreplay keeps original correlation context plus replay metadata

If you cannot test the logging backend directly, test the structured logger call or context object that feeds it.

The goal is to prevent silent context loss.


Rollout Checklist

Start with one critical workflow.

For example: checkout, signup, payment authorization, webhook processing, or order fulfillment.

  1. Pick the canonical field name: correlationId, traceId, or another standard already used by the platform.
  2. Decide where context is trusted, replaced, or sanitized.
  3. Add context at the public entry point.
  4. Forward it on internal HTTP/gRPC calls.
  5. Store it in queue message metadata or an event envelope.
  6. Restore it before worker logging.
  7. Add it to state-transition logs.
  8. Add tests for HTTP, queue, retry, and DLQ paths.
  9. Add a metric for missing context at boundaries.
  10. Use it during one real incident or staging failure before expanding.

Do not begin by adding correlation fields to every log in every service.

Begin with one workflow where debugging currently requires timestamp guessing. Prove that the investigation gets shorter. Then expand.


Final Takeaway

Correlation IDs in microservices are valuable because they make scattered evidence belong to one operation again.

They are not magic. They do not reconstruct causality, explain latency, or replace traces and metrics. They only give engineers a stable handle for grouping evidence across boundaries.

That handle becomes powerful when it is propagated consistently through HTTP calls, queue messages, workers, retries, dead-letter paths, and logs with real local meaning.

The practical goal is simple: when production behavior is unclear, an engineer should not have to guess which records belong together before they can start debugging.