
Transactional Outbox Pattern in Microservices
Publishing an event after a database write sounds straightforward right up until one succeeds and the other does not. That gap is where otherwise reasonable microservice workflows start producing lost events, duplicate processing, and state that no longer matches what downstream systems believe.
Where the Dual-Write Problem Comes From
A service handles a request:
- create an order
- save it in the database
- publish
order_createdto a message broker
At first, this looks like one logical action. In code, though, it is usually two separate writes to two separate systems:
- write to the database
- write to the broker
Those writes do not share one atomic transaction.
That means several bad outcomes are possible:
- the database write succeeds but event publishing fails
- the event is published but the database transaction rolls back
- the service crashes between the two steps
- the publish call times out after the broker may already have accepted the message
Once this happens, downstream services stop seeing the same reality as the source service.
That is the dual-write problem.
Why "Save Then Publish" Feels Safe
Most implementations begin with a simple assumption:
If the service writes to the database first and publishes immediately after, the two steps are close enough to be treated as one operation.
That assumption is understandable.
The code is short. The message broker is healthy. Failures seem unlikely. Retries appear available if publishing fails.
But "unlikely" is not the same thing as "impossible."
In production, process crashes, transient broker failures, deploy interruptions, connection resets, and timeout ambiguity all happen eventually. Once they do, a service that looked correct in code review can start losing messages in exactly the places the architecture depends on them.
This is the same broader pattern behind Why Tests Pass but Production Still Breaks: the system boundary is wider than the code path that looked safe in isolation.
What the Transactional Outbox Pattern Does
The transactional outbox pattern changes the workflow:
- write business data to the database
- write an outbox record to the same database transaction
- commit once
- publish outbox records to the broker asynchronously
The critical change is step 2.
Instead of trying to atomically write to the database and the broker together, the service makes the database the source of truth for both:
- the business state change
- the fact that an event must be published
If the transaction commits, both facts exist. If the transaction rolls back, neither fact exists.
That removes the most dangerous inconsistency: state changed, but no durable record exists that the event still needs to be sent.
How the Flow Works in Practice
Consider an order service.
Without an outbox:
await db.order.create({
data: {
id: orderId,
customerId,
status: 'placed',
},
});
await broker.publish('order.created', {
orderId,
customerId,
});
If the process crashes after db.order.create(...) and before broker.publish(...), the order exists but no event is emitted.
With an outbox:
await db.$transaction(async (tx) => {
await tx.order.create({
data: {
id: orderId,
customerId,
status: 'placed',
},
});
await tx.outbox.create({
data: {
topic: 'order.created',
aggregateType: 'order',
aggregateId: orderId,
payload: {
orderId,
customerId,
},
status: 'pending',
createdAt: new Date(),
},
});
})
Later, a publisher process reads pending outbox rows and sends them to the broker.
The service no longer depends on one fragile moment between two systems.
Example Outbox Table Design
A practical outbox table often includes:
CREATE TABLE outbox_events (
id BIGSERIAL PRIMARY KEY,
topic TEXT NOT NULL,
aggregate_type TEXT NOT NULL,
aggregate_id TEXT NOT NULL,
payload JSONB NOT NULL,
status TEXT NOT NULL DEFAULT 'pending',
published_at TIMESTAMPTZ,
available_at TIMESTAMPTZ NOT NULL DEFAULT now(),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
last_error TEXT
);
CREATE INDEX idx_outbox_pending
ON outbox_events (status, available_at, id);
Useful fields:
topicfor routing to the correct destinationaggregate_typeandaggregate_idfor traceabilitypayloadfor the event bodystatusfor publisher workflowavailable_atfor retry backofflast_errorfor debugging failed delivery
If you expect large event volume, retention and cleanup policy matter from the beginning.
Why This Solves the Hardest Failure Boundary
The outbox pattern does not guarantee that publishing never fails. It guarantees something more useful:
the intent to publish is stored durably in the same transaction as the business change.
That means a temporary broker outage no longer causes silent message loss. Publishing may be delayed, but the work is still recoverable.
This makes the failure mode operational rather than invisible:
- outbox backlog grows
- retry counts rise
- publisher alerts fire
Those are painful problems, but they are much safer than "event disappeared and nobody knows."
This is closely related to Background Jobs in Production, where the real requirement is not just retrying work, but retrying it from a state the system can still reason about.
What the Publisher Process Must Handle
The relay or publisher component is where many teams reintroduce correctness problems.
Its responsibilities usually include:
- reading pending records in batches
- publishing them to the broker
- marking them published only after confirmed send
- retrying transient failures with backoff
- surfacing poison events for investigation
A simplified flow:
const batch = await repo.getPendingEvents({ limit: 100 });
for (const event of batch) {
try {
await broker.publish(event.topic, event.payload);
await repo.markPublished(event.id);
} catch (error) {
await repo.markFailedAttempt(event.id, error);
}
}
This means your publisher should be treated like any other production worker:
- idempotent where possible
- observable
- backpressure-aware
- safe under restart
The same overload patterns described in Rate Limiting and Backpressure in Microservices still apply if your publisher drains work faster or slower than the rest of the system can tolerate.
Duplicates Still Exist, So Consumers Must Be Idempotent
The transactional outbox pattern solves lost-event risk much better than naive dual writes. It does not create exactly-once delivery across your whole architecture.
Duplicates can still happen if:
- the broker accepts a message but the publisher crashes before marking the row published
- retry logic republishes after an ambiguous timeout
- downstream consumers retry their own processing
That is why outbox producers and event consumers must be designed together.
Consumers should assume:
- the same event may arrive more than once
- ordering may not be perfect
- processing may resume after partial completion
This is the same correctness boundary discussed in Idempotency Keys for Duplicate API Requests, but moved from synchronous APIs into async messaging.
Polling Publisher vs CDC
There are two common ways to publish outbox records.
Polling publisher
A worker periodically queries the outbox table for pending rows.
Advantages:
- simple to implement
- easy to reason about
- works with normal application code
Tradeoffs:
- adds read load to the database
- introduces polling delay
- needs careful batching and indexing
Change data capture (CDC)
A CDC tool reads database changes from the transaction log and forwards outbox events to the broker.
Advantages:
- lower polling overhead
- lower latency
- good fit for high-throughput systems
Tradeoffs:
- more operational complexity
- more infrastructure to own
- harder local development story
For many teams, polling is the right first implementation. CDC becomes attractive when throughput or latency requirements justify the additional operational surface area.
Common Implementation Mistakes
Teams often adopt the outbox pattern and still leave important gaps:
Writing to the outbox outside the transaction
If the business row and outbox row are not committed together, the core guarantee is gone.
Deleting rows immediately after publish
Fast deletion removes useful audit and debugging context. A retention window is usually safer than instant cleanup.
Forgetting consumer idempotency
The producer became safer, but downstream processing still assumes exactly once. That is where duplicates become incidents.
Publishing huge payloads by default
Large payloads inflate storage, relay cost, and replay complexity. Events should communicate useful facts, not entire mutable object graphs.
No monitoring on backlog age
Queue length alone is not enough. The age of the oldest unpublished outbox row is often a better signal of user-facing delay.
Treating the outbox as a performance feature
It is a correctness pattern first. If the publisher falls behind, you now have operational work to do rather than silent inconsistency. That is an improvement, but only if the system is observed well.
When the Transactional Outbox Pattern Is Worth It
Use it when:
- domain events drive downstream workflows
- missed events create data inconsistency or product risk
- you cannot tolerate "DB committed, message lost"
- your service owns both the business write and the event publication decision
It is especially useful in workflows like:
- orders and payments
- billing and invoicing
- user lifecycle events
- search indexing triggers
- webhook dispatch pipelines
If losing an event would require manual repair or leave other services permanently stale, the outbox pattern is usually justified.
When It May Be Too Much
Not every system needs an outbox.
It may be unnecessary when:
- the event is informational and easy to rebuild later
- occasional loss is acceptable
- the system is still simple enough that synchronous coupling is intentional
- an existing platform already guarantees the workflow another way
The cost is real:
- another table to manage
- publisher infrastructure
- retry and cleanup policy
- consumer idempotency work
The point is not to add ceremony. The point is to remove an ambiguity that becomes expensive once multiple services depend on the same event stream.
The Core Principle
The transactional outbox pattern works because it stops pretending the database write and broker publish are one atomic action.
They are not.
Instead, it makes one thing atomic:
the business change and the durable record that the event still needs to be published.
That is usually the difference between a system that silently loses cross-service truth and one that degrades in a visible, recoverable way.
In microservices, that difference matters more than elegance. It determines whether failures become operational backlog or hidden inconsistency.