
Transactional Outbox Pattern in Microservices
The transactional outbox pattern solves the dual-write problem in microservices: one operation needs to update a database and publish a message, but those two writes cannot usually commit atomically together.
Without an outbox, a service can save an order and then crash before publishing order.created. Or it can publish the event and then roll back the database transaction. Both outcomes leave other services with a different version of reality than the service that owns the data.
The outbox pattern fixes the most dangerous part of that gap by storing the business change and the intent to publish in the same database transaction. A separate relay then publishes the outbox rows asynchronously.
This article is part of the Backend Reliability hub. It connects closely to Background Jobs in Production, PostgreSQL Job Queues with SKIP LOCKED, and Webhook Idempotency and Retries in Production, because the hard part is not adding a table. The hard part is making delayed work replayable, observable, and safe under duplicate delivery.
The Dual-Write Problem
Imagine an order service handling POST /orders.
The service needs to:
- insert an
ordersrow - reserve inventory
- publish
order.createdso billing, fulfillment, search indexing, email, and analytics can react
The naive implementation looks clean:
await db.order.create({
data: {
id: orderId,
customerId,
status: 'placed',
totalCents: 4200,
},
})
await broker.publish('order.created', {
orderId,
customerId,
totalCents: 4200,
})
The code has two writes:
| Write | System |
|---|---|
| Save the order | Database |
| Publish the event | Message broker |
Those writes do not share one local transaction.
That means these failure paths are real:
| Failure | Result |
|---|---|
| Database commits, process crashes before publish | Order exists, downstream services never hear about it |
| Database commits, broker is unavailable | Order exists, event is lost unless manually repaired |
| Broker accepts event, response times out | Service does not know whether to retry |
| Event publishes inside a transaction that later rolls back | Downstream services react to data that does not exist |
| Publish succeeds, marking local state fails | Relay may publish the same event again |
This is why the problem is not "how do we call the broker after saving?" The problem is "how do we make the business state and the publish intent move together?"
AWS Prescriptive Guidance describes this as the dual-write issue: a single operation writes to two systems, and a failure in either operation can create inconsistent data.
What The Transactional Outbox Guarantees
The transactional outbox pattern changes the workflow:
- Start a database transaction.
- Write the business data.
- Write an outbox event row in the same transaction.
- Commit once.
- Let a relay publish committed outbox rows later.
- Mark rows as published only after the broker acknowledges them.
The key guarantee is narrow and powerful:
If the business transaction commits, a durable record exists that says which event still needs to be published. If the transaction rolls back, the event record rolls back too.
That is the part you can make atomic.
The pattern does not guarantee exactly-once end-to-end delivery. The relay can publish a message and crash before marking it published. A broker can redeliver. A consumer can retry. The outbox makes message loss much less likely, but consumers still need idempotency.
Microservices.io's transactional outbox pattern frames the solution the same way: store the message in the database as part of the transaction that updates the business entities, then use a separate relay to send it to the broker.
A Failure Timeline The Outbox Prevents
Without an outbox:
| Time | Action | System state |
|---|---|---|
| 10:00:01 | POST /orders starts | No order |
| 10:00:02 | Order transaction commits | Order exists |
| 10:00:03 | Service starts publishing order.created | Broker call in flight |
| 10:00:04 | Process restarts during deploy | No event is durably recorded |
| 10:20:00 | Customer asks why no email arrived | Order exists, email service never saw event |
With an outbox:
| Time | Action | System state |
|---|---|---|
| 10:00:01 | POST /orders starts | No order |
| 10:00:02 | Order row and outbox row commit together | Order exists, publish intent exists |
| 10:00:03 | Service restarts during deploy | Outbox row is still pending |
| 10:00:30 | Relay starts again | Relay sees pending event |
| 10:00:31 | Relay publishes order.created | Downstream services receive event |
The second version can still have delay. It can still have retries. It can still have duplicate delivery.
But it does not silently lose the fact that something needs to be published.
That is the reliability improvement.
Write The Outbox Row In The Same Transaction
The outbox row must be written inside the same transaction as the business change.
This is the boundary that matters:
await db.transaction(async (tx) => {
const order = await tx.orders.insert({
id: orderId,
customer_id: customerId,
status: 'placed',
total_cents: totalCents,
})
await tx.outbox_events.insert({
id: eventId,
aggregate_type: 'order',
aggregate_id: order.id,
aggregate_version: order.version,
event_type: 'order.created',
event_version: 1,
topic: 'orders',
partition_key: `order:${order.id}`,
payload: {
orderId: order.id,
customerId,
totalCents,
status: 'placed',
},
headers: {
correlationId,
causationId: requestId,
},
})
})
There is no broker call inside the transaction.
That is intentional. Calling the broker inside the database transaction does not make the broker part of the transaction. It only holds database locks while the service waits on a network call that may succeed, fail, timeout, or become ambiguous.
If the broker accepts the event and the database transaction later rolls back, downstream services can observe an event for state that was never committed.
If the database commits and the broker call fails, downstream services miss the event.
The outbox avoids both by anchoring the event intent in the database transaction first.
A Practical Outbox Table
A production outbox table needs enough information for routing, ordering, tracing, retrying, deduplicating, and cleanup.
For PostgreSQL, a practical starting point is:
CREATE TABLE outbox_events (
id UUID PRIMARY KEY,
aggregate_type TEXT NOT NULL,
aggregate_id TEXT NOT NULL,
aggregate_version BIGINT,
event_type TEXT NOT NULL,
event_version INT NOT NULL DEFAULT 1,
topic TEXT NOT NULL,
partition_key TEXT NOT NULL,
payload JSONB NOT NULL,
headers JSONB NOT NULL DEFAULT '{}',
status TEXT NOT NULL DEFAULT 'pending'
CHECK (status IN ('pending', 'processing', 'published', 'dead')),
attempts INT NOT NULL DEFAULT 0,
available_at TIMESTAMPTZ NOT NULL DEFAULT now(),
claimed_at TIMESTAMPTZ,
claimed_by TEXT,
published_at TIMESTAMPTZ,
last_error TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX outbox_events_ready_idx
ON outbox_events (available_at, id)
WHERE status = 'pending';
CREATE INDEX outbox_events_processing_idx
ON outbox_events (claimed_at)
WHERE status = 'processing';
CREATE INDEX outbox_events_aggregate_idx
ON outbox_events (aggregate_type, aggregate_id, aggregate_version);
The fields have jobs:
| Field | Why it exists |
|---|---|
id | Stable event ID for deduplication and tracing |
aggregate_type and aggregate_id | Connect the event to the business object |
aggregate_version | Helps preserve or validate per-aggregate ordering |
event_type and event_version | Let consumers understand the event contract |
topic | Tells the relay where to publish |
partition_key | Keeps related events together when the broker supports partitioning |
payload | Stores the event body |
headers | Carries correlation IDs, trace context, tenant IDs, or schema metadata |
status | Drives the relay state machine |
attempts, available_at, last_error | Support retry and backoff |
claimed_at, claimed_by | Make stuck relay work visible |
published_at | Supports cleanup, auditing, and latency metrics |
Keep the payload intentionally small. An outbox row is not a data warehouse row. It should contain the event facts consumers need, not a full snapshot of every related record by default.
If consumers need to fetch more detail, publish an event that includes stable IDs and enough context to decide what to do next.
Event Shape Is A Contract
An outbox event is not just an implementation detail inside the producer. Once another service consumes it, the event shape becomes a contract.
For example:
{
"eventId": "0d2b8913-d3a6-4f7e-81b5-2977ad99d471",
"eventType": "order.created",
"eventVersion": 1,
"occurredAt": "2026-05-10T12:24:11Z",
"aggregate": {
"type": "order",
"id": "ord_123",
"version": 7
},
"data": {
"customerId": "cus_456",
"totalCents": 4200,
"currency": "USD"
}
}
That gives consumers four important things:
- a stable event ID for deduplication
- an event type and version for compatibility
- an aggregate identity for routing and ordering
- a payload that represents the committed fact
Do not publish "whatever the current ORM object happens to serialize." That couples consumers to internal storage shape and makes later refactoring dangerous.
If an event changes meaning, version it like an API contract. The same principle from API Contract Testing: Prevent Breaking Clients Before Release applies here: consumers depend on fields, types, enum values, and error-handling behavior even when the producer team thinks the change is small.
The Relay Is A Production Worker
The relay is the process that turns committed outbox rows into broker messages.
Treat it like a real background worker, not a tiny helper script.
A healthy relay does this:
- Claims a small batch of pending rows.
- Commits the claim quickly.
- Publishes each claimed event to the broker.
- Marks successful events as
published. - Schedules failed events for retry with backoff.
- Moves poison events to
deadafter a clear threshold. - Emits metrics about backlog, age, attempts, and publish latency.
A simplified relay loop:
async function runOutboxRelay(workerId: string) {
while (true) {
const events = await outbox.claimBatch({ workerId, limit: 100 })
if (events.length === 0) {
await sleep(500)
continue
}
for (const event of events) {
try {
await broker.publish({
topic: event.topic,
key: event.partitionKey,
value: event.payload,
headers: {
eventId: event.id,
eventType: event.eventType,
...event.headers,
},
})
await outbox.markPublished(event.id)
} catch (error) {
await outbox.scheduleRetry(event.id, {
error,
nextAttemptAt: calculateBackoff(event.attempts),
})
}
}
}
}
Notice the relay marks an event as published only after the broker call returns successfully.
That still leaves an unavoidable ambiguity: the broker may accept the message, then the relay may crash before markPublished. When the relay restarts, it can publish the event again.
That is why the outbox pattern and consumer idempotency are a pair.
Claim Rows Safely With SKIP LOCKED
If you run more than one relay worker, workers need to divide pending rows without blocking each other or publishing the same row at the same time.
In PostgreSQL, a common polling-relay claim query uses FOR UPDATE SKIP LOCKED:
WITH next_events AS (
SELECT id
FROM outbox_events
WHERE status = 'pending'
AND available_at <= now()
ORDER BY available_at, id
FOR UPDATE SKIP LOCKED
LIMIT 100
)
UPDATE outbox_events AS event
SET
status = 'processing',
claimed_at = now(),
claimed_by = $1,
attempts = attempts + 1,
updated_at = now()
FROM next_events
WHERE event.id = next_events.id
RETURNING
event.id,
event.topic,
event.partition_key,
event.event_type,
event.payload,
event.headers,
event.attempts;
Run this inside a short transaction. Do not hold row locks while publishing to the broker.
The PostgreSQL SELECT documentation notes that SKIP LOCKED skips rows that cannot be locked immediately, which can be useful to avoid lock contention when multiple consumers access a queue-like table. It also warns that this gives an inconsistent view of the data, so it is not for general-purpose reads.
That caveat is exactly why it fits a relay claim query: the relay does not need a complete analytical view of the table. It needs a safe way for several workers to claim different rows.
Recover Stuck Processing Rows
The relay can crash after claiming rows and before publishing or retrying them.
If claimed rows stay in processing forever, the outbox becomes a slow message loss mechanism.
Add a lease recovery step:
UPDATE outbox_events
SET
status = 'pending',
available_at = now(),
claimed_at = NULL,
claimed_by = NULL,
updated_at = now(),
last_error = 'relay lease expired before publish completed'
WHERE status = 'processing'
AND claimed_at < now() - interval '2 minutes';
The lease interval should be longer than normal publish latency plus a reasonable margin. If normal publishes take 200 ms, a two-minute lease is conservative. If publishes can legitimately take one minute, use a longer lease or reduce batch size.
The goal is not to guess perfectly. The goal is to make stuck work visible and recoverable.
Duplicates Still Happen
The transactional outbox prevents one class of event loss. It does not make distributed messaging exactly once.
Duplicates can still happen when:
- the relay publishes successfully but crashes before marking the row
published - the broker accepts a publish but the relay receives a timeout
- the relay retry policy republishes after an ambiguous result
- the broker redelivers a message
- the consumer processes work and crashes before recording completion
AWS documents the same consumer requirement for SQS Standard queues: messages can be delivered more than once, and applications should be idempotent.
A typical consumer deduplication table looks like this:
CREATE TABLE processed_messages (
consumer_name TEXT NOT NULL,
event_id UUID NOT NULL,
processed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (consumer_name, event_id)
);
Then the consumer records the event ID in the same transaction as its side effect:
await db.transaction(async (tx) => {
const inserted = await tx.processedMessages.insertIfNotExists({
consumerName: 'receipt-email-worker',
eventId: event.id,
})
if (!inserted) {
return
}
await tx.receiptEmails.insert({
orderId: event.data.orderId,
customerId: event.data.customerId,
status: 'pending',
})
})
If the same event arrives again, the unique key prevents the business side effect from running twice.
That is the same shape as API idempotency, only at the messaging boundary. The synchronous version is covered in API Idempotency Keys: Prevent Duplicate Requests Safely.
Ordering Needs A Real Policy
Outbox rows often need ordering by aggregate.
For example:
order.createdorder.paidorder.cancelled
If consumers see order.cancelled before order.created, they may fail or produce nonsense.
The outbox table can help, but it does not solve ordering automatically.
Decide what ordering means:
| Requirement | Practical approach |
|---|---|
| Order only matters within one aggregate | Use aggregate_id as the broker partition key |
| Consumers must reject stale transitions | Include aggregate_version and apply only expected next versions |
| A topic has many independent aggregates | Preserve order per aggregate, not globally |
| Strict global order is required | Reconsider the architecture; global order is expensive and fragile |
When publishing to a partitioned broker, use a stable key such as order:<orderId> so all events for the same aggregate land in the same partition or message group.
When using a polling relay, be careful with multiple workers. Claiming rows by ORDER BY id does not guarantee that publish acknowledgements complete in the same order. If strict per-aggregate order matters, constrain relay concurrency per aggregate or let the broker partition key preserve order after publish.
Ordering is a product and data-correctness question, not just a relay implementation detail.
Polling Publisher Or CDC
There are two common ways to move committed outbox rows to the broker.
| Approach | How it works | Strengths | Trade-offs |
|---|---|---|---|
| Polling publisher | Application worker queries pending rows and publishes them | Simple, explicit, easy to debug, works with normal app code | Adds database read load, needs careful indexing, has polling delay |
| Change data capture | CDC tool reads committed outbox inserts from the database log | Lower polling load, good throughput, natural fit for Kafka-style pipelines | More infrastructure, harder local development, operationally deeper |
Polling is often the right first implementation.
It is easy to inspect:
SELECT status, count(*), min(created_at), max(attempts)
FROM outbox_events
GROUP BY status;
It is also easy to repair:
UPDATE outbox_events
SET status = 'pending',
available_at = now(),
last_error = 'manual retry after broker incident'
WHERE status = 'dead'
AND topic = 'orders'
AND created_at >= now() - interval '1 hour';
CDC becomes attractive when outbox volume is high, polling delay matters, or your organization already runs a reliable CDC platform.
Debezium's Outbox Event Router documentation describes a CDC-based implementation where a connector captures changes in an outbox table and applies an outbox router transformation. Its default table shape includes an event ID, aggregate type, aggregate ID, event type, and payload, which maps closely to the schema concerns above.
Do not choose CDC just because it sounds more advanced. Choose it when the operational team is ready to own connector lag, schema changes, topic routing, connector restarts, and replay behavior.
What To Monitor
The outbox pattern turns hidden inconsistency into visible backlog. That only helps if you monitor the backlog.
Useful signals:
| Signal | Why it matters |
|---|---|
| Oldest pending event age | Shows how long committed business changes wait before publication |
| Pending count by topic | Shows which workflow is falling behind |
| Processing rows older than the lease | Finds crashed or stuck relays |
| Publish success and failure rate | Shows broker or relay health |
| Attempts per event | Finds poison messages and repeated ambiguous publishes |
| Published latency p50/p95/p99 | Shows the user-visible delay between commit and downstream awareness |
| Dead event count | Shows events that need manual triage |
| Consumer duplicate count | Shows whether relay or broker retries are increasing |
An outbox alert should usually fire on age, not just count.
A backlog of 10,000 events may be fine if the relay is draining a planned replay. A backlog of 20 events may be serious if the oldest event is 45 minutes old and blocks paid orders from reaching fulfillment.
For observability across the request, relay, broker, and consumer, propagate correlation IDs or trace context through outbox headers. That workflow is covered in Correlation IDs in Microservices.
Cleanup And Retention
Do not delete outbox rows immediately after publish unless you are certain you never need them for debugging, replay, audit, or deduplication.
A common policy:
| Row state | Retention |
|---|---|
published | Keep 7-30 days, then archive or delete |
dead | Keep until manually reviewed |
pending | Keep until published or explicitly cancelled |
processing | Recover after lease expiry |
For high-volume tables, cleanup needs to be gentle:
WITH old_rows AS (
SELECT id
FROM outbox_events
WHERE status = 'published'
AND published_at < now() - interval '14 days'
ORDER BY published_at
LIMIT 1000
)
DELETE FROM outbox_events
WHERE id IN (SELECT id FROM old_rows);
Large deletes can create database pressure, lock contention, or replication lag. If volume is high, consider partitioning by time or moving old published rows into archive storage.
The cleanup job is part of the pattern. Ignoring it turns a reliability table into a slow database-growth incident.
What To Test
The outbox pattern deserves tests at the failure boundaries, not only on the success path.
Test at least these cases:
| Test | What it proves |
|---|---|
| Business transaction commits | Business row and outbox row both exist |
| Business transaction rolls back | Neither business row nor outbox row exists |
| Relay publishes successfully | Event is marked published after broker acknowledgement |
| Broker publish fails | Event returns to pending with backoff and error details |
| Relay crashes after claim | Lease recovery makes the row publishable again |
| Relay publishes twice | Consumer deduplication prevents duplicate side effects |
| Two relay workers run | Workers claim different rows without blocking each other |
| Event schema changes | Consumer contract tests catch incompatible payload changes |
For the relay itself, integration tests are more useful than isolated unit tests because the behavior depends on database locks, transactions, and row state. Use the same testing mindset from How to Write API Integration Tests: prove the boundary where the bug would actually happen.
Common Mistakes
The outbox pattern fails most often when teams weaken the exact guarantee they introduced it for.
Common mistakes:
| Mistake | Why it breaks the pattern |
|---|---|
| Writing the outbox row after the transaction | Reintroduces the commit-then-crash gap |
| Publishing inside the transaction | Holds database locks around ambiguous network I/O |
| Marking published before broker acknowledgement | Can lose messages on relay crash |
| Assuming no duplicates | Breaks consumers during relay or broker retries |
| Using no event ID | Makes deduplication and tracing much harder |
| Ignoring event versioning | Makes consumer compatibility accidental |
| No stuck-row recovery | Turns relay crashes into permanent pending work |
| No backlog-age alert | Hides delayed downstream workflows |
| Deleting published rows immediately | Removes evidence needed for incidents and replay |
| Treating the outbox as a performance feature | The primary value is correctness and recoverability |
Most of these are not exotic distributed-systems failures. They are ordinary production interleavings that happen during deploys, broker incidents, slow queries, worker crashes, and retry storms.
When To Use The Pattern
Use the transactional outbox pattern when:
- a committed database change must be reflected in a message or event
- losing the event creates product, financial, or data-consistency risk
- downstream services rely on events for business workflow
- a message can be retried safely if consumers are idempotent
- the service owns both the business write and the decision to publish
Good fits:
- order lifecycle events
- payment state changes
- billing and invoice workflows
- search indexing triggers
- notification jobs
- webhook dispatch
- fulfillment or provisioning workflows
Do not use it only because an architecture diagram looks more "event-driven."
It may be unnecessary when:
- the event is informational and easy to rebuild later
- occasional event loss is acceptable
- synchronous coupling is intentional and simpler
- an existing platform already provides the durable workflow
- the team cannot yet operate the relay, monitoring, and consumer idempotency
The cost is real: another table, relay code, retry policy, cleanup job, monitoring, and consumer deduplication.
The value is also real: when the database says something happened, the system has a durable path for telling the rest of the architecture.
Implementation Checklist
Before trusting an outbox in production, confirm:
- The business row and outbox row are written in the same database transaction.
- No broker publish call happens inside that transaction.
- Every event has a stable event ID.
- The event payload is treated as a consumer-facing contract.
- The relay claims rows atomically and keeps claim transactions short.
- Stuck
processingrows recover after a lease expires. - Publish failures use bounded retry with backoff.
- Poison events move to a visible
deadstate. - Consumers deduplicate by event ID in the same transaction as their side effect.
- Ordering requirements are written down per aggregate or per topic.
- Oldest pending event age is monitored and alerted.
- Published rows have a retention and cleanup policy.
- CDC is chosen only when the team can operate the extra infrastructure.
If those checks are missing, the table may exist, but the reliability boundary is still incomplete.
Final Takeaway
The transactional outbox pattern works because it stops pretending that a database write and a broker publish are one atomic operation.
They are not.
Instead, it makes one smaller promise true: the business change and the durable publish intent commit together.
After that, the relay, broker, and consumers still need retries, idempotency, ordering policy, monitoring, and cleanup. That may sound like more work, but it is much better work than discovering later that the event disappeared and every downstream service quietly moved on without it.