
Transactional Outbox Pattern in Microservices
Publishing an event after a database write sounds straightforward right up until one succeeds and the other does not. That gap is where otherwise reasonable microservice workflows start losing events, duplicating downstream work, or leaving different services with different ideas of what happened.
That is the dual-write problem.
For the broader backend reliability cluster around durable recovery, queues, retries, and failure containment, see the Backend Reliability hub.
Where the Dual-Write Problem Actually Comes From
A service handles a request, creates an order, stores it in the database, and then publishes order.created to a broker. In the code, it feels like one logical operation. In reality, it is two writes to two different systems.
Those writes do not share one atomic transaction.
That means several bad outcomes are always possible: the database write succeeds but event publishing fails, the event is published but the database transaction rolls back, the process crashes between the two steps, or the publish call times out after the broker may already have accepted the message.
Once this happens, downstream services stop seeing the same reality as the source service.
This is the same broader pattern behind Why Tests Pass but Production Still Breaks: the safe-looking local code path is narrower than the real system boundary.
A Failure Timeline Teams Actually See
The dual-write bug usually does not show up as an exception with a neat stack trace. It shows up as two systems disagreeing hours later.
Imagine this sequence:
- the order row commits successfully
- the service calls
publish(order.created) - the broker accepts the message, but the network drops the response
- the application logs a timeout and retries or crashes during deploy
- downstream inventory or billing systems never see a clean, trusted event flow
From the source service's perspective, the request may have looked partially successful or uncertain. From downstream systems, the state now looks stale or duplicated.
That is why the outbox pattern matters so much. It turns an invisible disagreement between systems into a durable piece of work that can still be retried and inspected later.
Why "Save Then Publish" Feels Safer Than It Is
Most implementations begin with a very understandable assumption:
If we write to the database first and publish immediately after, the gap is small enough to ignore.
That assumption survives local testing because the code is short, the broker is healthy, and failures are rare enough that the bad interleavings do not show up often.
But "rare" is not the same thing as "acceptable." Process crashes, deploy interruptions, connection resets, timeout ambiguity, and broker failures do happen. Once they do, a service that looked correct in code review can silently lose the event that other services were depending on.
What the Transactional Outbox Pattern Changes
The transactional outbox pattern changes the workflow:
- write business data to the database
- write an outbox record in the same database transaction
- commit once
- publish outbox records asynchronously afterward
The critical move is step 2.
Instead of pretending the service can atomically write to the database and the broker together, the design makes the database the source of truth for both the business change and the fact that an event still needs to be published.
If the transaction commits, both facts exist durably. If the transaction rolls back, neither fact exists.
That removes the most dangerous inconsistency: state changed, but no durable record exists that the event still needs to be sent.
If the same aggregate can be updated concurrently by multiple requests, this pattern still needs a safe concurrency strategy around the underlying rows. That is where Optimistic vs Pessimistic Locking in SQL becomes relevant.
The Difference In Code
Without an outbox:
await db.order.create({
data: {
id: orderId,
customerId,
status: 'placed',
},
});
await broker.publish('order.created', {
orderId,
customerId,
});
If the process crashes after db.order.create(...) and before broker.publish(...), the order exists but no event is emitted.
With an outbox:
await db.$transaction(async (tx) => {
await tx.order.create({
data: {
id: orderId,
customerId,
status: 'placed',
},
});
await tx.outbox.create({
data: {
topic: 'order.created',
aggregateType: 'order',
aggregateId: orderId,
payload: {
orderId,
customerId,
},
status: 'pending',
createdAt: new Date(),
},
});
});
Later, a relay process reads pending outbox rows and sends them to the broker.
The service no longer depends on one fragile moment between two systems.
A Practical Outbox Table
A useful outbox table often looks like this:
CREATE TABLE outbox_events (
id BIGSERIAL PRIMARY KEY,
topic TEXT NOT NULL,
aggregate_type TEXT NOT NULL,
aggregate_id TEXT NOT NULL,
payload JSONB NOT NULL,
status TEXT NOT NULL DEFAULT 'pending',
published_at TIMESTAMPTZ,
available_at TIMESTAMPTZ NOT NULL DEFAULT now(),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
last_error TEXT
);
CREATE INDEX idx_outbox_pending
ON outbox_events (status, available_at, id);
The important fields are not glamorous, but they are what make the pattern operable. topic routes the event. aggregate_type and aggregate_id make it traceable. payload stores the publishable fact. status, available_at, and last_error support the relay workflow.
If you expect meaningful event volume, retention and cleanup strategy matter from the beginning.
Why the Pattern Is Valuable Even Though Publish Can Still Fail
The transactional outbox pattern does not guarantee that publishing never fails. It guarantees something more useful:
the intent to publish is stored durably in the same transaction as the business change.
That means a temporary broker outage no longer causes silent message loss. Publishing may be delayed, but the work is still visible and recoverable.
This changes the failure mode from hidden inconsistency to visible operational pressure. Outbox backlog grows, retry counts rise, publisher alerts fire, and the age of the oldest unpublished event starts to climb.
Those are painful problems, but they are much safer than "the event disappeared and nobody knows."
This is closely related to Background Jobs in Production, where the real requirement is not just retrying work, but retrying it from a state the system can still reason about.
The Relay Process Is Part of the Correctness Model
Many teams adopt the outbox pattern and then reintroduce correctness problems in the relay itself.
The relay usually needs to read pending events in batches, publish them to the broker, mark them published only after confirmed send, retry transient failures with backoff, and surface poison events for investigation.
A simplified relay loop looks like this:
const batch = await repo.getPendingEvents({ limit: 100 });
for (const event of batch) {
try {
await broker.publish(event.topic, event.payload);
await repo.markPublished(event.id);
} catch (error) {
await repo.markFailedAttempt(event.id, error);
}
}
That relay should be treated like any other production worker: observable, restart-safe, and backpressure-aware. The same overload patterns described in Rate Limiting and Backpressure in Microservices still apply if the publisher falls behind.
Relay Claiming Needs Its Own Concurrency Discipline
If you run more than one relay process, they still need a safe way to divide work. Otherwise the system can reintroduce duplication and contention in the very component meant to make publishing safer.
One common pattern is to claim publishable rows in batches with FOR UPDATE SKIP LOCKED:
WITH next_events AS (
SELECT id
FROM outbox_events
WHERE status = 'pending'
AND available_at <= now()
ORDER BY id
FOR UPDATE SKIP LOCKED
LIMIT 100
)
UPDATE outbox_events
SET status = 'processing'
WHERE id IN (SELECT id FROM next_events)
RETURNING id, topic, payload;
The exact state model can vary, but the principle is the same as with any queue-like worker system: claim rows atomically, publish only what you actually claimed, and keep the claim transaction short.
Duplicates Still Exist, So Consumers Must Be Idempotent
The outbox pattern dramatically reduces lost-event risk. It does not create exactly-once delivery across the whole architecture.
Duplicates can still happen if the broker accepts a message but the relay crashes before marking the row published, if retry logic republishes after an ambiguous timeout, or if downstream consumers retry their own work.
That is why outbox producers and event consumers must be designed together. Consumers should assume the same event may arrive more than once, ordering may not be perfect, and processing may resume after partial completion.
This is the same correctness boundary discussed in API Idempotency Keys, just moved from synchronous APIs into async messaging.
Polling Publisher vs CDC
There are two common ways to move outbox rows to the broker.
Polling publisher
A worker periodically queries the outbox table for pending rows. If that worker claims rows directly from PostgreSQL, PostgreSQL Job Queues with SKIP LOCKED is the practical companion pattern for avoiding worker contention.
Polling is attractive because it is simple to implement, easy to reason about, and fits normal application code. The tradeoff is added read load on the database, some polling delay, and a need for careful batching and indexing.
Change data capture (CDC)
A CDC tool reads database changes from the transaction log and forwards outbox records to the broker.
CDC is attractive because it reduces polling overhead, lowers latency, and often fits high-throughput systems better. The tradeoff is more operational complexity, more infrastructure to own, and a harder local development story.
For many teams, polling is the right first implementation. CDC becomes attractive when throughput or latency requirements justify the extra operational surface area.
What To Monitor In Production
The outbox pattern is only as reassuring as the observability around it.
The most useful signals are often the age of the oldest unpublished event, publish success and retry rates, rows stuck in processing, per-topic backlog growth, and how long business changes take to become visible downstream.
Those metrics tell you whether the system is merely delaying work safely or whether the relay is quietly becoming the next reliability bottleneck.
Common Implementation Mistakes
The outbox pattern helps only if the surrounding details stay disciplined.
Common mistakes include writing the outbox row outside the transaction, deleting rows immediately after publish, forgetting consumer idempotency, publishing oversized payloads by default, ignoring backlog age, and treating the outbox as a performance feature rather than a correctness feature.
Most failures here are not exotic. They come from weakening the exact guarantees the pattern was meant to introduce.
When the Pattern Is Worth It
Use the transactional outbox pattern when domain events drive downstream workflows, missed events create product or data risk, you cannot tolerate "database committed, message lost," and the service itself owns both the business write and the decision to publish.
It is especially useful in workflows like orders, payments, billing, user lifecycle events, search indexing triggers, and webhook dispatch pipelines.
If losing an event would require manual repair or leave other services permanently stale, the outbox pattern is usually justified.
When It May Be Too Much
Not every system needs an outbox.
It may be unnecessary when the event is informational and easy to rebuild later, occasional loss is acceptable, the system is still simple enough that synchronous coupling is intentional, or an existing platform already guarantees the workflow another way.
The cost is real: another table, another relay, cleanup and retry policy, and consumer idempotency work. The point is not to add ceremony. The point is to remove an ambiguity that becomes expensive once multiple services depend on the same event stream.
Final Thoughts
The transactional outbox pattern works because it stops pretending the database write and broker publish are one atomic action.
They are not.
Instead, it makes one thing atomic: the business change and the durable record that the event still needs to be published.
That is usually the difference between a system that silently loses cross-service truth and one that degrades in a visible, recoverable way.