
Observability vs Logging in Production
Observability vs logging becomes a real production problem when a team has plenty of logs but still cannot explain why latency rose, where a request stalled, or which service changed the outcome.
Logging records local events. Observability connects multiple signals so engineers can inspect system behavior from the right angle: metrics for patterns, traces for flow, logs for local detail, and correlation context for joining evidence across boundaries.
The distinction matters because many incidents are not missing-log problems. They are interaction problems.
For the implementation-oriented path after this article, see OpenTelemetry for Backend Engineers. For the narrower failure mode where log volume itself makes incidents harder, see Too Much Logging in Production Breaks Debugging.
The Practical Difference
Logging answers:
What happened here?
Observability helps answer:
What is happening across the system, and which evidence reduces uncertainty?
That sounds abstract until an incident starts.
Imagine checkout latency rises from 350 ms to 3.8 seconds at p95. Error rate is still low. Customer support reports intermittent slow checkouts. Every service writes structured logs. Search works. Dashboards exist, but they are not organized around the checkout path.
A logging-first investigation often starts like this:
- Search checkout logs for slow requests.
- Copy a request ID.
- Search payment logs around the same timestamp.
- Search inventory logs.
- Compare timestamps manually.
- Guess whether payment, inventory, the database, or retries caused the delay.
Nothing about that is foolish. It is often the only available path.
The problem is that the team is asking logs to answer every question.
An observability-first investigation uses each signal for the question it is good at:
| Question | Better first signal | Why |
|---|---|---|
| Is the issue broad or isolated? | metrics | shows rate, latency, errors, saturation, and trends |
| Which operation path is slow? | traces | shows one request or job across service boundaries |
| What local detail explains a span or error? | logs | shows payload shape, domain state, exception, or decision point |
| Which evidence belongs together? | correlation IDs or trace context | joins logs, traces, and async work |
| Did a mitigation work? | metrics plus traces | shows system-level change and path-level behavior |
Observability is not "more data." It is better routing from question to evidence.
OpenTelemetry describes signals as system outputs that show activity from different angles, and lists traces, metrics, logs, and baggage as supported signals. See the OpenTelemetry signals documentation.
Why Logging Alone Feels Sufficient
Logs are familiar because they are close to code.
When a developer adds a log line, they choose the message, fields, severity, and placement. That makes logging feel precise. It can capture domain language that generic metrics and traces do not know:
logger.warn('Payment authorization retried after provider timeout', {
requestId,
orderId,
paymentProvider,
retryAttempt,
providerStatus,
})
That line is useful. During an incident, it can explain a local decision that a metric cannot.
Logs are strong for:
- exception details
- validation failures
- domain decisions
- audit-style events
- rare branch explanations
- one component's local state
- support investigations
OpenTelemetry defines a log as a recording of an event, usually a timestamped text record with optional metadata, and recommends structured logs where possible. See OpenTelemetry's logs concept page.
So the mistake is not using logs.
The mistake is expecting logs to preserve the whole incident shape after the system becomes concurrent, distributed, asynchronous, and partially failing.
Where Logging Starts To Break Down
Logs become weak when the important question is not "did this line run?"
They struggle with questions like:
- Which dependency caused the p95 spike?
- Did the retry happen before or after the provider accepted the request?
- Which queue delay belongs to this user-visible timeout?
- Did a database lock cause worker lag, or did worker lag cause lock pressure?
- Is this an isolated bad request or a system-wide saturation pattern?
- Did the cache return stale data before or after the write committed?
Logs can contain pieces of the answer. But the answer is a relationship, not a single event.
That is where observability begins to matter.
Logs Do Not Show Pattern Well
Logs can be counted, but they are not naturally a time-series model.
During an incident, the first question is usually about shape:
- how many requests are affected?
- when did it begin?
- is it region-specific?
- is p95 rising while p50 stays normal?
- are retries masking errors?
- is queue depth growing before latency?
Metrics answer those questions more directly.
A log query may eventually show the pattern, but it often starts from a guessed field, guessed severity, or guessed message. Metrics should make the pattern visible before the team knows which log line matters.
Logs Do Not Show Flow Well
A request may move through:
frontend -> checkout-api -> payment-api -> provider
|
-> inventory-api -> database
|
-> outbox -> receipt-worker
Each service can log honestly and still leave the team reconstructing flow by hand.
Traces are designed for that shape. They show how one logical operation moved through boundaries, where time accumulated, and where the operation failed or branched.
If your current system cannot even keep one request handle across services and jobs, start with Correlation IDs in Microservices. Correlation is not full observability, but disconnected logs are a hard place to investigate from.
Logs Do Not Show Causality By Themselves
Log order is tempting.
It looks like a timeline:
10:03:01 checkout started
10:03:02 payment retry started
10:03:03 inventory reserved
10:03:04 checkout completed
In a distributed system, that sequence may be a query artifact. Events can be buffered, ingested late, emitted by different clocks, or generated by different attempts of the same logical operation.
The team still has to reason about causality.
OpenTelemetry's logging specification describes log correlation through time, trace context, and resource context. The important part is that correlation is explicit metadata, not a story inferred from raw line order. See the OpenTelemetry logging specification.
A Slow Checkout Example
Suppose checkout p95 latency rises after a deploy.
The log-heavy investigation finds messages like this:
checkout-api: started checkout request_id=req-781
payment-api: provider timeout request_id=req-781 attempt=1
payment-api: retrying authorization request_id=req-781 attempt=2
inventory-api: reservation succeeded request_id=req-781
checkout-api: checkout completed request_id=req-781 duration_ms=3820
This is useful local evidence. It is not enough.
The team still needs to know:
- Is this only one request or a broad latency shift?
- Did p95 rise before or after retries increased?
- Is the payment provider slow, or is the payment service queueing locally?
- Did inventory contribute meaningful time?
- Did the deploy change all regions or only one?
- Are successful requests slow too, or only retried requests?
An observability-driven investigation might look like this:
| Step | Signal | What it answers |
|---|---|---|
| 1 | checkout latency metric by region | p95 rose only in us-east |
| 2 | retry counter by provider | retries increased for one payment provider |
| 3 | trace for slow checkout | most latency is inside payment.authorize span |
| 4 | payment logs linked from the span | provider returned timeout after remote acceptance |
| 5 | database and queue metrics | inventory and receipt worker were stable |
| 6 | deploy marker | new payment timeout config started the pattern |
The logs still matter. They explain the local provider response.
But metrics and traces narrowed the search first.
That ordering is the practical difference between logging and observability.
What Each Signal Should Own
One way to reduce observability confusion is to assign responsibilities.
Metrics Own Shape
Metrics should answer:
- how often is it happening?
- how slow is it?
- where is it happening?
- is it getting worse?
- is the system saturated?
- did the mitigation help?
Good production metrics include:
| Metric | Why it helps |
|---|---|
| request rate | separates traffic change from failure change |
| error rate by class | shows whether failures are broad or specific |
| latency percentiles | shows tail behavior hidden by averages |
| retry count | reveals amplification and masked dependency issues |
| queue depth and age | shows delayed work before users complain |
| saturation metrics | shows capacity limits and backpressure |
Metrics do not explain every individual request. They tell the team where to look.
Traces Own Flow
Traces should answer:
- which service handled this operation?
- where did it spend time?
- which downstream call failed?
- which retry or async branch was involved?
- what attributes describe the operation?
Useful spans are not just automatic HTTP spans. They often include domain operations:
checkout.create
inventory.reserve
payment.authorize
provider.http.request
outbox.enqueue_receipt
That shape lets engineers see the operation, not just the process that logged it.
Logs Own Local Detail
Logs should answer:
- what exact provider response did we receive?
- which validation rule rejected this input?
- which branch did a domain decision take?
- what local exception was thrown?
- what context should support or audit staff see later?
A good log line is not "more text." It is a local fact connected to the rest of the investigation.
Correlation Own Joining
Correlation IDs, trace IDs, span IDs, resource attributes, and consistent request context help connect evidence.
OpenTelemetry's logs documentation notes that application logs can be correlated with active traces and spans when instrumentation is active. That matters because the log is no longer a disconnected line; it becomes evidence attached to a path.
A Decision Table For Instrumentation
When adding instrumentation, choose the signal from the question.
| Need | Use | Avoid |
|---|---|---|
| see whether checkout latency is getting worse | metric | scanning logs for slow messages |
| find where one checkout spent time | trace | comparing timestamps across services by hand |
| explain why payment retry happened | log | adding more spans for every branch detail |
| connect worker logs to originating request | correlation context | relying on wall-clock proximity |
| prove a deploy improved p95 | metric plus deploy marker | a few successful example logs |
| debug one malformed webhook payload | log | broad new dashboards |
| inspect an async workflow across services | trace plus correlation ID | one service's logs only |
This table also prevents instrumentation sprawl.
If every signal tries to answer every question, the system becomes noisy and expensive without becoming easier to debug.
Common Migration Mistakes
Teams often make the logging-to-observability shift harder than necessary.
Mistake 1: Treating Observability As A Vendor Switch
Changing tools does not automatically produce observability.
If spans are missing business names, metrics do not match service-level symptoms, and logs cannot be correlated with requests, the new platform may just display disconnected evidence in a nicer interface.
The operating model matters more than the logo.
Mistake 2: Adding Traces Without Useful Attributes
Traces without useful names and attributes become screenshots of latency, not explanations.
Prefer attributes that help incident questions:
checkout.region
payment.provider
payment.retry_attempt
queue.name
job.kind
feature_flag.variant
Avoid attributes with unbounded values unless the backend and sampling plan can handle them.
Mistake 3: Keeping Logs As The Main Incident Interface
If every incident still begins with "search all logs for this time window," observability has not changed the workflow yet.
A better default:
- Use metrics to define the problem shape.
- Use traces to find the affected path.
- Use logs to explain local decisions inside that path.
- Use hypotheses and experiments to verify the cause.
That investigation discipline is the same one described in How to Debug Effectively.
When Logging Alone Is Enough
Not every system needs a full tracing rollout immediately.
Logging plus basic metrics may be enough when:
- the application is small
- most work happens in one process
- failures are local and reproducible
- request paths are short
- async processing is minimal
- manual timeline reconstruction is rare
In that situation, disciplined structured logging can be the right level of investment. Google Cloud's structured logging documentation describes how JSON log payloads allow querying specific fields and indexing fields in log payloads, which is often a useful step before broader observability work. See Google Cloud's structured logging documentation.
The trigger for more observability is not team size or fashionable tooling.
It is repeated uncertainty during incidents:
- "We have logs, but we cannot tell where the time went."
- "We can see errors, but not which dependency caused them."
- "We know a job failed, but not which request created it."
- "We cannot tell whether this is one bad request or a system trend."
- "Every incident starts with a giant log search."
When those sentences become normal, logging is carrying too much of the investigation.
Practical Upgrade Path
A useful observability rollout can be small.
Start with one critical path, not the whole system.
- Pick one user-visible operation, such as checkout, signup, payment, or search.
- Define the questions the team struggles to answer during incidents.
- Add metrics for rate, errors, latency, retries, queue age, and saturation.
- Add traces around the operation path and the most important downstream calls.
- Make logs structured and correlate them with trace or request context.
- Use the signals during a real investigation.
- Remove or downgrade logs that no longer answer useful questions.
This is also a good way to keep cost under control. Instrumentation should be expanded because it helps answer real questions, not because every service needs every possible signal on day one.
Final Takeaway
Logging is part of observability, but it is not observability by itself.
Logs are strongest as local evidence. Metrics show the shape of production behavior. Traces show the path of an operation. Correlation context connects the evidence.
When a production incident is about one local error, logs may be enough. When the incident is about timing, causality, retries, queues, saturation, or cross-service interaction, logs alone usually force engineers to reconstruct relationships by hand.
The practical goal is not to collect more telemetry.
The goal is to make the next debugging question cheaper to answer.