OpenTelemetry for Backend Engineers

OpenTelemetry for Backend Engineers

OpenTelemetry helps backend engineers when production debugging depends on following one request, job, or message across service boundaries.

The common mistake is adopting OpenTelemetry as a telemetry-volume project: add auto-instrumentation, send spans somewhere, celebrate that traces exist, and then discover during the next incident that engineers still cannot answer which dependency slowed down, which retry happened first, which job belonged to the user-visible request, or which service version produced the failing span.

That is not an OpenTelemetry failure. It is an instrumentation-design failure.

A useful OpenTelemetry rollout starts with the production questions engineers need to answer. Then it instruments the path that answers those questions with stable spans, low-cardinality attributes, trace context that survives async boundaries, metrics that show system shape, logs that correlate with the active span, and a sampling/export path the team understands.

For the broader reason logs alone struggle with this kind of work, see Observability vs Logging in Production.


What OpenTelemetry Is For

OpenTelemetry is a vendor-neutral observability framework for generating, collecting, and exporting telemetry such as traces, metrics, and logs. The official OpenTelemetry docs describe instrumentation as the work of making system components emit those signals, either through code-based APIs and SDKs or through zero-code instrumentation. See the OpenTelemetry docs on instrumentation.

That definition matters, but it is not the operational goal.

The operational goal is this:

A production path should become easier to navigate than to reconstruct.

Without connected telemetry, a slow checkout incident often turns into log archaeology:

checkout-api logs -> payment logs -> inventory logs -> queue logs -> database dashboard

Each system has evidence, but the relationships are weak.

With useful OpenTelemetry instrumentation, the investigation should look more like:

latency alert
  -> route metric shows checkout p95 rising
  -> representative trace shows payment-auth span consumes 1.8s
  -> span attributes show retry_attempt=2 and region=eu-west
  -> correlated logs show provider rate-limit response
  -> metrics show this is regional, not fleet-wide

The value is not that a trace exists. The value is that the trace reduces uncertainty.


Traces, Metrics, And Logs Have Different Jobs

OpenTelemetry organizes telemetry into signals. Its specification describes tracing, metrics, logs, resources, context propagation, semantic conventions, collectors, and instrumentation libraries as separate pieces of the client architecture. See the OpenTelemetry specification overview.

For backend engineers, the practical split is:

SignalBest questionWeak question
TracesHow did one request, job, or message move through the system?Is the whole service getting worse?
MetricsHow often, how slow, how many, and how widespread?What exactly happened inside one failed request?
LogsWhat detailed local event or error happened here?What was the full cross-service path?

OpenTelemetry does not make logs obsolete. It puts logs in a healthier role.

Logs are still the right place for detailed local evidence: provider error bodies, validation failures, domain state transitions, and debugging context that would be too noisy as span attributes.

Traces are the right place for causality. Metrics are the right place for scale and trend. Logs are the right place for local detail.

If your current debugging process starts by searching large log streams and manually rebuilding the path, Too Much Logging in Production Breaks Debugging is the failure mode OpenTelemetry should help you escape.


Start With One Production Path

Do not start by instrumenting everything.

Start with one important backend path that currently takes too long to debug:

  • checkout request
  • invoice creation
  • webhook delivery
  • background job execution
  • search indexing job
  • payment authorization
  • user signup flow

Then write the questions the telemetry must answer:

Production questionSignal that should answer it
Which dependency consumed most of the request time?trace spans
Is this local to one route, tenant, queue, or region?metrics with bounded attributes
Did retries hide a partial failure?span attributes and retry metrics
Which job belongs to this request?trace context propagation
What exact provider error happened?correlated logs
Did a new deployment introduce it?resource attributes and deployment metadata

That table is the rollout plan.

If the first path does not become easier to debug, expanding to ten more paths only multiplies the noise.

This is the same habit as good debugging: define the symptom and evidence path before changing the system. The general workflow is covered in How to Debug Effectively: A Practical Guide.


Set Up The SDK Before Application Code

In Node.js, OpenTelemetry setup must run before the application code that should be instrumented. The official Node.js getting-started guide shows a separate instrumentation file, NodeSDK, and the auto-instrumentations-node package for automatic spans around supported libraries. See the OpenTelemetry JavaScript Node.js guide.

A production-shaped setup usually sends telemetry through OTLP and sets resource attributes explicitly:

instrumentation.ts
import { NodeSDK } from '@opentelemetry/sdk-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http'
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics'
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'
import { resourceFromAttributes } from '@opentelemetry/resources'

const sdk = new NodeSDK({
  resource: resourceFromAttributes({
    'service.name': 'billing-api',
    'service.version': process.env.APP_VERSION ?? 'unknown',
    'deployment.environment.name': process.env.NODE_ENV ?? 'development',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_TRACES_ENDPOINT,
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: process.env.OTEL_EXPORTER_OTLP_METRICS_ENDPOINT,
    }),
  }),
  instrumentations: [getNodeAutoInstrumentations()],
})

sdk.start()

The details vary by runtime and package versions, but the ownership model is stable:

  • initialize OpenTelemetry before app imports
  • set service.name deliberately
  • attach deployment and version metadata
  • use auto-instrumentation for framework and client-library edges
  • add manual spans only where the business workflow is otherwise unclear

OpenTelemetry resources describe the entity producing telemetry. The docs specifically recommend setting service.name because SDK defaults can otherwise produce unknown service names. See OpenTelemetry resources.

That one field sounds small. In practice, wrong or missing service names make every trace harder to trust.


Use Auto-Instrumentation For Edges

Auto-instrumentation is a good first move because it covers boring but important edges:

  • inbound HTTP requests
  • outbound HTTP calls
  • database clients
  • framework middleware
  • supported messaging clients

Those spans answer the first incident question: where did this operation spend time?

But auto-instrumentation does not know your business workflow. It may show a PostgreSQL query, an HTTP request, and a queue publish span, while still hiding that all three were part of createInvoice.

That is where manual spans help.

Use manual spans for business operations that engineers naturally talk about during incidents:

billing.ts
import { SpanStatusCode, trace } from '@opentelemetry/api'

const tracer = trace.getTracer('billing-workflow')

export async function createInvoice(orderId: string, accountId: string) {
  return tracer.startActiveSpan('invoice.create', async (span) => {
    span.setAttributes({
      'order.id': orderId,
      'account.id': accountId,
      'billing.operation': 'invoice.create',
    })

    try {
      const order = await loadOrder(orderId)
      const invoice = await invoiceGateway.create(order)
      await saveInvoice(invoice)

      span.setAttributes({
        'invoice.id': invoice.id,
        'invoice.status': invoice.status,
      })

      return invoice
    } catch (error) {
      span.recordException(error as Error)
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error instanceof Error ? error.message : 'invoice create failed',
      })
      throw error
    } finally {
      span.end()
    }
  })
}

The span is useful because it marks a meaningful operation. The child spans from HTTP clients, database clients, or queue clients explain where time and failure happened inside that operation.

Avoid manual spans around every function. A trace full of tiny internal implementation spans is harder to read than a trace with fewer spans at the boundaries engineers care about.


Design Attributes Like A Query Interface

Span attributes decide whether traces remain useful after the first demo.

OpenTelemetry semantic conventions define common names for operations and data so telemetry is easier to standardize across services, libraries, and platforms. See OpenTelemetry semantic conventions.

Use those conventions when they fit. Then define a small local vocabulary for your domain.

Good attributes are:

  • stable
  • bounded in cardinality
  • safe to store
  • useful during incidents
  • named consistently across services

Weak attributes are:

  • raw URLs with IDs embedded in the path
  • full email addresses
  • unbounded payload fields
  • stack traces as attributes
  • different names for the same concept in each service

A practical attribute table might look like this:

AttributeGood useRisk
http.route/orders/:orderId/paysafer than raw URL
service.namebilling-apimust be explicit and stable
deployment.environment.nameproductionuseful for filtering
queue.nameinvoice-eventsgood bounded dimension
retry.attempt0, 1, 2useful for retry analysis
account.tierfree, pro, enterprisesafer than account email
account.idinternal opaque IDonly if privacy and cardinality are acceptable

This is where many rollouts decay. One service emits tenant_id, another emits accountId, another emits customer.id, and the backend has to guess whether they mean the same thing.

Treat attribute names as part of the observability contract.


Propagate Context Across Async Boundaries

The first serious OpenTelemetry test is not an HTTP request. It is the boundary after the request returns.

Background jobs, message queues, scheduled work, webhook retries, and outbox relays are where traces often break.

OpenTelemetry's JavaScript propagation docs explain that context propagation lets signals correlate regardless of where they are generated and lets traces build causal information across services and process boundaries. They also note that supported libraries usually propagate context automatically, while custom protocols may need manual propagation. See OpenTelemetry JavaScript propagation.

For a queue payload, the shape often looks like this:

enqueue-invoice-job.ts
import { context, propagation } from '@opentelemetry/api'

type InvoiceJob = {
  type: 'invoice.create'
  orderId: string
  traceContext: Record<string, string>
}

export async function enqueueInvoiceJob(orderId: string) {
  const traceContext: Record<string, string> = {}

  propagation.inject(context.active(), traceContext)

  const job: InvoiceJob = {
    type: 'invoice.create',
    orderId,
    traceContext,
  }

  await queue.publish(job)
}

Then the worker extracts the context before creating the consumer span:

invoice-worker.ts
import { context, propagation, trace } from '@opentelemetry/api'

const tracer = trace.getTracer('invoice-worker')

export async function handleInvoiceJob(job: InvoiceJob) {
  const parentContext = propagation.extract(context.active(), job.traceContext)

  return context.with(parentContext, async () => {
    return tracer.startActiveSpan('invoice.worker.process', async (span) => {
      span.setAttributes({
        'job.type': job.type,
        'order.id': job.orderId,
      })

      try {
        await createInvoice(job.orderId)
      } finally {
        span.end()
      }
    })
  })
}

The exact carrier may be HTTP headers, message headers, job metadata, or an outbox row. The goal is the same: the consumer should not start as an unrelated trace when it is causally related to the producer.

If your system only needs request grouping before full tracing, Correlation IDs in Microservices is the smaller pattern. OpenTelemetry gives you the stronger path model once grouping alone is not enough.


Add Metrics That Explain Trace Patterns

Traces help inspect one operation. Metrics tell you whether that operation represents a wider problem.

OpenTelemetry describes metrics as runtime measurements with associated metadata, useful for availability, performance, alerting, and scaling decisions. See OpenTelemetry metrics.

For a backend path, useful custom metrics might include:

MetricWhy it matters
invoice_create_duration_msuser-visible latency for the workflow
invoice_create_failures_totalfailure volume by reason
invoice_gateway_retries_totalretry amplification
invoice_jobs_lag_msasync delay after request return
invoice_jobs_in_flightworker saturation

Keep metric dimensions bounded:

good: route, region, queue_name, dependency, error_class
dangerous: user_email, raw_url, full_exception_message, request_id

A trace might show one checkout request waiting on payment authorization. Metrics tell you whether that is one unlucky request, one region, one provider, one deployment, or a fleet-wide dependency incident.

This is especially important for overload paths. When retries, queues, and backpressure interact, aggregate signals matter as much as individual traces. For that adjacent failure mode, see Adding Retries Can Make Outages Worse.


Keep Logs Correlated, Not Bloated

OpenTelemetry logging support exists to correlate logs with other telemetry and normalize log records, but for JavaScript the official status page currently lists traces and metrics as stable while logs are still in development. See the OpenTelemetry JavaScript status.

That means a pragmatic backend rollout should usually keep the existing logging library and make logs trace-aware.

Good logs near spans include:

  • trace_id
  • span_id
  • service name
  • operation name
  • stable domain identifiers
  • error class
  • short provider response code or reason

Avoid solving trace gaps by dumping payloads into logs.

OpenTelemetry's logging specification describes log correlation through time, execution context such as trace and span IDs, and resource context. See the OpenTelemetry logging specification.

The useful rule is:

Logs should explain a span, not replace the trace.

During an incident, an engineer should be able to open the failing span, jump to the relevant logs, and find the local detail that was too specific for attributes.


Use A Collector When Production Needs Control

In early development, exporting directly to an observability backend can be fine.

In production, the OpenTelemetry Collector is often the cleaner boundary. The official Collector docs describe it as a vendor-agnostic way to receive, process, and export telemetry data, and recommend it for offloading data quickly from services while the Collector handles batching, retries, encryption, filtering, and related processing. See the OpenTelemetry Collector.

That boundary is useful because telemetry pipelines become production systems too.

The Collector can help with:

  • batching exports
  • retrying backend delivery
  • filtering sensitive data
  • routing telemetry to different backends
  • centralizing exporter configuration
  • reducing per-service vendor coupling

Do not hide all observability policy inside service code. If every service owns its own exporter behavior, sampling, filtering, and destination logic, the telemetry system becomes hard to change safely.


Sampling Is A Product Decision, Not Only A Cost Setting

Sampling reduces telemetry volume. In JavaScript, OpenTelemetry documents that all spans are sampled by default and that TraceIdRatioBasedSampler can deterministically sample a percentage of traces. See OpenTelemetry JavaScript sampling.

The trap is treating sampling only as cost control.

Sampling changes what incidents are easy to investigate:

Sampling decisionDebugging consequence
Sample 100% on a low-volume critical pathexpensive but easy to inspect
Sample 10% of all tracescheaper, but rare failures may vanish
Keep errors preferentiallybetter incident debugging, more pipeline complexity
Sample background jobs separatelyavoids losing async failures under HTTP volume
Tail-sample slow tracesbetter latency investigations if your pipeline supports it

For a young backend system, the practical approach is usually:

  • keep important low-volume paths unsampled at first
  • sample high-volume success traffic when needed
  • verify that errors, slow traces, and rare job failures remain visible
  • document what sampling means before using trace counts as metrics

Metrics should be the source of truth for counts and rates. Sampled traces are evidence paths, not a complete statistical record unless the sampling design supports that use.


Review Instrumentation Like Production Code

Instrumentation changes the debugging contract of the system. Review it with the same seriousness as API or database changes.

Use this checklist:

  • Does the root span name describe a stable operation, not a raw URL?
  • Do child spans cover the important dependency boundaries?
  • Does context survive HTTP, queue, worker, and callback boundaries?
  • Are attributes stable, safe, and low-cardinality?
  • Are service name, version, and environment explicit?
  • Are errors recorded on spans with enough detail to triage?
  • Do metrics show whether a trace represents a broader pattern?
  • Do logs include trace and span identifiers where useful?
  • Is sampling understood by the team?
  • Can an engineer debug one real production path faster than before?

The last question is the real acceptance test.

If a rollout creates beautiful dashboards but does not shorten an actual investigation, keep the rollout narrow and improve the evidence path.


Common OpenTelemetry Mistakes

Instrumenting internals before boundaries

Function-level spans are rarely the first thing a backend team needs. Start with request, dependency, database, queue, and worker boundaries.

Using high-cardinality attributes casually

Raw user input, request IDs, emails, and unique payload values can make telemetry expensive and hard to query. Put them in logs only when safe and necessary.

Letting async work start a new unrelated trace

If queue workers, outbox relays, and scheduled callbacks lose context, traces will miss the hardest part of many incidents.

Treating traces as metrics

Sampled traces are not automatically reliable for rates. Use metrics for counts, saturation, and trends.

Forgetting that telemetry has failure modes

Exporters can block, collectors can fall behind, backend quotas can throttle ingestion, and verbose instrumentation can add overhead. The telemetry path needs its own operational thinking.


Production Rollout Plan

A reasonable first rollout looks like this:

  1. Pick one production path with real debugging pain.
  2. Define the questions telemetry must answer.
  3. Enable auto-instrumentation for inbound HTTP, outbound HTTP, database clients, and supported messaging clients.
  4. Add manual spans around the business operation.
  5. Set resource attributes: service, version, environment.
  6. Standardize the first attribute vocabulary.
  7. Propagate context across the first async boundary.
  8. Add a small metric set for latency, failures, retries, and queue lag.
  9. Correlate logs with trace and span IDs.
  10. Use the telemetry during a real incident, bug investigation, or load test.
  11. Fix the evidence path before expanding to the next workflow.

This keeps OpenTelemetry tied to production understanding instead of tool adoption.


Final Takeaway

OpenTelemetry is most valuable when it changes backend debugging from reconstruction to navigation.

A good rollout does not ask, "Do we have traces?" It asks whether a slow request, failed job, retry storm, or cross-service incident is easier to follow than it was before.

That requires more than installing the SDK. It requires stable operation names, useful attributes, context propagation across async boundaries, metrics that explain scale, logs that correlate with spans, a collector/export path the team can operate, and sampling decisions that preserve the failures engineers actually need to inspect.

When those pieces are in place, OpenTelemetry becomes less like an observability checkbox and more like a practical map of how backend work moves through the system.