Transactional Outbox Pattern in Microservices

Transactional Outbox Pattern in Microservices

The transactional outbox pattern solves the dual-write problem in microservices: one operation needs to update a database and publish a message, but those two writes cannot usually commit atomically together.

Without an outbox, a service can save an order and then crash before publishing order.created. Or it can publish the event and then roll back the database transaction. Both outcomes leave other services with a different version of reality than the service that owns the data.

The outbox pattern fixes the most dangerous part of that gap by storing the business change and the intent to publish in the same database transaction. A separate relay then publishes the outbox rows asynchronously.

This article is part of the Backend Reliability hub. It connects closely to Background Jobs in Production, PostgreSQL Job Queues with SKIP LOCKED, and Webhook Idempotency and Retries in Production, because the hard part is not adding a table. The hard part is making delayed work replayable, observable, and safe under duplicate delivery.


The Dual-Write Problem

Imagine an order service handling POST /orders.

The service needs to:

  1. insert an orders row
  2. reserve inventory
  3. publish order.created so billing, fulfillment, search indexing, email, and analytics can react

The naive implementation looks clean:

await db.order.create({
  data: {
    id: orderId,
    customerId,
    status: 'placed',
    totalCents: 4200,
  },
})

await broker.publish('order.created', {
  orderId,
  customerId,
  totalCents: 4200,
})

The code has two writes:

WriteSystem
Save the orderDatabase
Publish the eventMessage broker

Those writes do not share one local transaction.

That means these failure paths are real:

FailureResult
Database commits, process crashes before publishOrder exists, downstream services never hear about it
Database commits, broker is unavailableOrder exists, event is lost unless manually repaired
Broker accepts event, response times outService does not know whether to retry
Event publishes inside a transaction that later rolls backDownstream services react to data that does not exist
Publish succeeds, marking local state failsRelay may publish the same event again

This is why the problem is not "how do we call the broker after saving?" The problem is "how do we make the business state and the publish intent move together?"

AWS Prescriptive Guidance describes this as the dual-write issue: a single operation writes to two systems, and a failure in either operation can create inconsistent data.


What The Transactional Outbox Guarantees

The transactional outbox pattern changes the workflow:

  1. Start a database transaction.
  2. Write the business data.
  3. Write an outbox event row in the same transaction.
  4. Commit once.
  5. Let a relay publish committed outbox rows later.
  6. Mark rows as published only after the broker acknowledges them.

The key guarantee is narrow and powerful:

If the business transaction commits, a durable record exists that says which event still needs to be published. If the transaction rolls back, the event record rolls back too.

That is the part you can make atomic.

The pattern does not guarantee exactly-once end-to-end delivery. The relay can publish a message and crash before marking it published. A broker can redeliver. A consumer can retry. The outbox makes message loss much less likely, but consumers still need idempotency.

Microservices.io's transactional outbox pattern frames the solution the same way: store the message in the database as part of the transaction that updates the business entities, then use a separate relay to send it to the broker.


A Failure Timeline The Outbox Prevents

Without an outbox:

TimeActionSystem state
10:00:01POST /orders startsNo order
10:00:02Order transaction commitsOrder exists
10:00:03Service starts publishing order.createdBroker call in flight
10:00:04Process restarts during deployNo event is durably recorded
10:20:00Customer asks why no email arrivedOrder exists, email service never saw event

With an outbox:

TimeActionSystem state
10:00:01POST /orders startsNo order
10:00:02Order row and outbox row commit togetherOrder exists, publish intent exists
10:00:03Service restarts during deployOutbox row is still pending
10:00:30Relay starts againRelay sees pending event
10:00:31Relay publishes order.createdDownstream services receive event

The second version can still have delay. It can still have retries. It can still have duplicate delivery.

But it does not silently lose the fact that something needs to be published.

That is the reliability improvement.


Write The Outbox Row In The Same Transaction

The outbox row must be written inside the same transaction as the business change.

This is the boundary that matters:

await db.transaction(async (tx) => {
  const order = await tx.orders.insert({
    id: orderId,
    customer_id: customerId,
    status: 'placed',
    total_cents: totalCents,
  })

  await tx.outbox_events.insert({
    id: eventId,
    aggregate_type: 'order',
    aggregate_id: order.id,
    aggregate_version: order.version,
    event_type: 'order.created',
    event_version: 1,
    topic: 'orders',
    partition_key: `order:${order.id}`,
    payload: {
      orderId: order.id,
      customerId,
      totalCents,
      status: 'placed',
    },
    headers: {
      correlationId,
      causationId: requestId,
    },
  })
})

There is no broker call inside the transaction.

That is intentional. Calling the broker inside the database transaction does not make the broker part of the transaction. It only holds database locks while the service waits on a network call that may succeed, fail, timeout, or become ambiguous.

If the broker accepts the event and the database transaction later rolls back, downstream services can observe an event for state that was never committed.

If the database commits and the broker call fails, downstream services miss the event.

The outbox avoids both by anchoring the event intent in the database transaction first.


A Practical Outbox Table

A production outbox table needs enough information for routing, ordering, tracing, retrying, deduplicating, and cleanup.

For PostgreSQL, a practical starting point is:

CREATE TABLE outbox_events (
  id UUID PRIMARY KEY,
  aggregate_type TEXT NOT NULL,
  aggregate_id TEXT NOT NULL,
  aggregate_version BIGINT,
  event_type TEXT NOT NULL,
  event_version INT NOT NULL DEFAULT 1,
  topic TEXT NOT NULL,
  partition_key TEXT NOT NULL,
  payload JSONB NOT NULL,
  headers JSONB NOT NULL DEFAULT '{}',
  status TEXT NOT NULL DEFAULT 'pending'
    CHECK (status IN ('pending', 'processing', 'published', 'dead')),
  attempts INT NOT NULL DEFAULT 0,
  available_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  claimed_at TIMESTAMPTZ,
  claimed_by TEXT,
  published_at TIMESTAMPTZ,
  last_error TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX outbox_events_ready_idx
  ON outbox_events (available_at, id)
  WHERE status = 'pending';

CREATE INDEX outbox_events_processing_idx
  ON outbox_events (claimed_at)
  WHERE status = 'processing';

CREATE INDEX outbox_events_aggregate_idx
  ON outbox_events (aggregate_type, aggregate_id, aggregate_version);

The fields have jobs:

FieldWhy it exists
idStable event ID for deduplication and tracing
aggregate_type and aggregate_idConnect the event to the business object
aggregate_versionHelps preserve or validate per-aggregate ordering
event_type and event_versionLet consumers understand the event contract
topicTells the relay where to publish
partition_keyKeeps related events together when the broker supports partitioning
payloadStores the event body
headersCarries correlation IDs, trace context, tenant IDs, or schema metadata
statusDrives the relay state machine
attempts, available_at, last_errorSupport retry and backoff
claimed_at, claimed_byMake stuck relay work visible
published_atSupports cleanup, auditing, and latency metrics

Keep the payload intentionally small. An outbox row is not a data warehouse row. It should contain the event facts consumers need, not a full snapshot of every related record by default.

If consumers need to fetch more detail, publish an event that includes stable IDs and enough context to decide what to do next.


Event Shape Is A Contract

An outbox event is not just an implementation detail inside the producer. Once another service consumes it, the event shape becomes a contract.

For example:

{
  "eventId": "0d2b8913-d3a6-4f7e-81b5-2977ad99d471",
  "eventType": "order.created",
  "eventVersion": 1,
  "occurredAt": "2026-05-10T12:24:11Z",
  "aggregate": {
    "type": "order",
    "id": "ord_123",
    "version": 7
  },
  "data": {
    "customerId": "cus_456",
    "totalCents": 4200,
    "currency": "USD"
  }
}

That gives consumers four important things:

  • a stable event ID for deduplication
  • an event type and version for compatibility
  • an aggregate identity for routing and ordering
  • a payload that represents the committed fact

Do not publish "whatever the current ORM object happens to serialize." That couples consumers to internal storage shape and makes later refactoring dangerous.

If an event changes meaning, version it like an API contract. The same principle from API Contract Testing: Prevent Breaking Clients Before Release applies here: consumers depend on fields, types, enum values, and error-handling behavior even when the producer team thinks the change is small.


The Relay Is A Production Worker

The relay is the process that turns committed outbox rows into broker messages.

Treat it like a real background worker, not a tiny helper script.

A healthy relay does this:

  1. Claims a small batch of pending rows.
  2. Commits the claim quickly.
  3. Publishes each claimed event to the broker.
  4. Marks successful events as published.
  5. Schedules failed events for retry with backoff.
  6. Moves poison events to dead after a clear threshold.
  7. Emits metrics about backlog, age, attempts, and publish latency.

A simplified relay loop:

async function runOutboxRelay(workerId: string) {
  while (true) {
    const events = await outbox.claimBatch({ workerId, limit: 100 })

    if (events.length === 0) {
      await sleep(500)
      continue
    }

    for (const event of events) {
      try {
        await broker.publish({
          topic: event.topic,
          key: event.partitionKey,
          value: event.payload,
          headers: {
            eventId: event.id,
            eventType: event.eventType,
            ...event.headers,
          },
        })

        await outbox.markPublished(event.id)
      } catch (error) {
        await outbox.scheduleRetry(event.id, {
          error,
          nextAttemptAt: calculateBackoff(event.attempts),
        })
      }
    }
  }
}

Notice the relay marks an event as published only after the broker call returns successfully.

That still leaves an unavoidable ambiguity: the broker may accept the message, then the relay may crash before markPublished. When the relay restarts, it can publish the event again.

That is why the outbox pattern and consumer idempotency are a pair.


Claim Rows Safely With SKIP LOCKED

If you run more than one relay worker, workers need to divide pending rows without blocking each other or publishing the same row at the same time.

In PostgreSQL, a common polling-relay claim query uses FOR UPDATE SKIP LOCKED:

WITH next_events AS (
  SELECT id
  FROM outbox_events
  WHERE status = 'pending'
    AND available_at <= now()
  ORDER BY available_at, id
  FOR UPDATE SKIP LOCKED
  LIMIT 100
)
UPDATE outbox_events AS event
SET
  status = 'processing',
  claimed_at = now(),
  claimed_by = $1,
  attempts = attempts + 1,
  updated_at = now()
FROM next_events
WHERE event.id = next_events.id
RETURNING
  event.id,
  event.topic,
  event.partition_key,
  event.event_type,
  event.payload,
  event.headers,
  event.attempts;

Run this inside a short transaction. Do not hold row locks while publishing to the broker.

The PostgreSQL SELECT documentation notes that SKIP LOCKED skips rows that cannot be locked immediately, which can be useful to avoid lock contention when multiple consumers access a queue-like table. It also warns that this gives an inconsistent view of the data, so it is not for general-purpose reads.

That caveat is exactly why it fits a relay claim query: the relay does not need a complete analytical view of the table. It needs a safe way for several workers to claim different rows.


Recover Stuck Processing Rows

The relay can crash after claiming rows and before publishing or retrying them.

If claimed rows stay in processing forever, the outbox becomes a slow message loss mechanism.

Add a lease recovery step:

UPDATE outbox_events
SET
  status = 'pending',
  available_at = now(),
  claimed_at = NULL,
  claimed_by = NULL,
  updated_at = now(),
  last_error = 'relay lease expired before publish completed'
WHERE status = 'processing'
  AND claimed_at < now() - interval '2 minutes';

The lease interval should be longer than normal publish latency plus a reasonable margin. If normal publishes take 200 ms, a two-minute lease is conservative. If publishes can legitimately take one minute, use a longer lease or reduce batch size.

The goal is not to guess perfectly. The goal is to make stuck work visible and recoverable.


Duplicates Still Happen

The transactional outbox prevents one class of event loss. It does not make distributed messaging exactly once.

Duplicates can still happen when:

  • the relay publishes successfully but crashes before marking the row published
  • the broker accepts a publish but the relay receives a timeout
  • the relay retry policy republishes after an ambiguous result
  • the broker redelivers a message
  • the consumer processes work and crashes before recording completion

AWS documents the same consumer requirement for SQS Standard queues: messages can be delivered more than once, and applications should be idempotent.

A typical consumer deduplication table looks like this:

CREATE TABLE processed_messages (
  consumer_name TEXT NOT NULL,
  event_id UUID NOT NULL,
  processed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  PRIMARY KEY (consumer_name, event_id)
);

Then the consumer records the event ID in the same transaction as its side effect:

await db.transaction(async (tx) => {
  const inserted = await tx.processedMessages.insertIfNotExists({
    consumerName: 'receipt-email-worker',
    eventId: event.id,
  })

  if (!inserted) {
    return
  }

  await tx.receiptEmails.insert({
    orderId: event.data.orderId,
    customerId: event.data.customerId,
    status: 'pending',
  })
})

If the same event arrives again, the unique key prevents the business side effect from running twice.

That is the same shape as API idempotency, only at the messaging boundary. The synchronous version is covered in API Idempotency Keys: Prevent Duplicate Requests Safely.


Ordering Needs A Real Policy

Outbox rows often need ordering by aggregate.

For example:

  1. order.created
  2. order.paid
  3. order.cancelled

If consumers see order.cancelled before order.created, they may fail or produce nonsense.

The outbox table can help, but it does not solve ordering automatically.

Decide what ordering means:

RequirementPractical approach
Order only matters within one aggregateUse aggregate_id as the broker partition key
Consumers must reject stale transitionsInclude aggregate_version and apply only expected next versions
A topic has many independent aggregatesPreserve order per aggregate, not globally
Strict global order is requiredReconsider the architecture; global order is expensive and fragile

When publishing to a partitioned broker, use a stable key such as order:<orderId> so all events for the same aggregate land in the same partition or message group.

When using a polling relay, be careful with multiple workers. Claiming rows by ORDER BY id does not guarantee that publish acknowledgements complete in the same order. If strict per-aggregate order matters, constrain relay concurrency per aggregate or let the broker partition key preserve order after publish.

Ordering is a product and data-correctness question, not just a relay implementation detail.


Polling Publisher Or CDC

There are two common ways to move committed outbox rows to the broker.

ApproachHow it worksStrengthsTrade-offs
Polling publisherApplication worker queries pending rows and publishes themSimple, explicit, easy to debug, works with normal app codeAdds database read load, needs careful indexing, has polling delay
Change data captureCDC tool reads committed outbox inserts from the database logLower polling load, good throughput, natural fit for Kafka-style pipelinesMore infrastructure, harder local development, operationally deeper

Polling is often the right first implementation.

It is easy to inspect:

SELECT status, count(*), min(created_at), max(attempts)
FROM outbox_events
GROUP BY status;

It is also easy to repair:

UPDATE outbox_events
SET status = 'pending',
    available_at = now(),
    last_error = 'manual retry after broker incident'
WHERE status = 'dead'
  AND topic = 'orders'
  AND created_at >= now() - interval '1 hour';

CDC becomes attractive when outbox volume is high, polling delay matters, or your organization already runs a reliable CDC platform.

Debezium's Outbox Event Router documentation describes a CDC-based implementation where a connector captures changes in an outbox table and applies an outbox router transformation. Its default table shape includes an event ID, aggregate type, aggregate ID, event type, and payload, which maps closely to the schema concerns above.

Do not choose CDC just because it sounds more advanced. Choose it when the operational team is ready to own connector lag, schema changes, topic routing, connector restarts, and replay behavior.


What To Monitor

The outbox pattern turns hidden inconsistency into visible backlog. That only helps if you monitor the backlog.

Useful signals:

SignalWhy it matters
Oldest pending event ageShows how long committed business changes wait before publication
Pending count by topicShows which workflow is falling behind
Processing rows older than the leaseFinds crashed or stuck relays
Publish success and failure rateShows broker or relay health
Attempts per eventFinds poison messages and repeated ambiguous publishes
Published latency p50/p95/p99Shows the user-visible delay between commit and downstream awareness
Dead event countShows events that need manual triage
Consumer duplicate countShows whether relay or broker retries are increasing

An outbox alert should usually fire on age, not just count.

A backlog of 10,000 events may be fine if the relay is draining a planned replay. A backlog of 20 events may be serious if the oldest event is 45 minutes old and blocks paid orders from reaching fulfillment.

For observability across the request, relay, broker, and consumer, propagate correlation IDs or trace context through outbox headers. That workflow is covered in Correlation IDs in Microservices.


Cleanup And Retention

Do not delete outbox rows immediately after publish unless you are certain you never need them for debugging, replay, audit, or deduplication.

A common policy:

Row stateRetention
publishedKeep 7-30 days, then archive or delete
deadKeep until manually reviewed
pendingKeep until published or explicitly cancelled
processingRecover after lease expiry

For high-volume tables, cleanup needs to be gentle:

WITH old_rows AS (
  SELECT id
  FROM outbox_events
  WHERE status = 'published'
    AND published_at < now() - interval '14 days'
  ORDER BY published_at
  LIMIT 1000
)
DELETE FROM outbox_events
WHERE id IN (SELECT id FROM old_rows);

Large deletes can create database pressure, lock contention, or replication lag. If volume is high, consider partitioning by time or moving old published rows into archive storage.

The cleanup job is part of the pattern. Ignoring it turns a reliability table into a slow database-growth incident.


What To Test

The outbox pattern deserves tests at the failure boundaries, not only on the success path.

Test at least these cases:

TestWhat it proves
Business transaction commitsBusiness row and outbox row both exist
Business transaction rolls backNeither business row nor outbox row exists
Relay publishes successfullyEvent is marked published after broker acknowledgement
Broker publish failsEvent returns to pending with backoff and error details
Relay crashes after claimLease recovery makes the row publishable again
Relay publishes twiceConsumer deduplication prevents duplicate side effects
Two relay workers runWorkers claim different rows without blocking each other
Event schema changesConsumer contract tests catch incompatible payload changes

For the relay itself, integration tests are more useful than isolated unit tests because the behavior depends on database locks, transactions, and row state. Use the same testing mindset from How to Write API Integration Tests: prove the boundary where the bug would actually happen.


Common Mistakes

The outbox pattern fails most often when teams weaken the exact guarantee they introduced it for.

Common mistakes:

MistakeWhy it breaks the pattern
Writing the outbox row after the transactionReintroduces the commit-then-crash gap
Publishing inside the transactionHolds database locks around ambiguous network I/O
Marking published before broker acknowledgementCan lose messages on relay crash
Assuming no duplicatesBreaks consumers during relay or broker retries
Using no event IDMakes deduplication and tracing much harder
Ignoring event versioningMakes consumer compatibility accidental
No stuck-row recoveryTurns relay crashes into permanent pending work
No backlog-age alertHides delayed downstream workflows
Deleting published rows immediatelyRemoves evidence needed for incidents and replay
Treating the outbox as a performance featureThe primary value is correctness and recoverability

Most of these are not exotic distributed-systems failures. They are ordinary production interleavings that happen during deploys, broker incidents, slow queries, worker crashes, and retry storms.


When To Use The Pattern

Use the transactional outbox pattern when:

  • a committed database change must be reflected in a message or event
  • losing the event creates product, financial, or data-consistency risk
  • downstream services rely on events for business workflow
  • a message can be retried safely if consumers are idempotent
  • the service owns both the business write and the decision to publish

Good fits:

  • order lifecycle events
  • payment state changes
  • billing and invoice workflows
  • search indexing triggers
  • notification jobs
  • webhook dispatch
  • fulfillment or provisioning workflows

Do not use it only because an architecture diagram looks more "event-driven."

It may be unnecessary when:

  • the event is informational and easy to rebuild later
  • occasional event loss is acceptable
  • synchronous coupling is intentional and simpler
  • an existing platform already provides the durable workflow
  • the team cannot yet operate the relay, monitoring, and consumer idempotency

The cost is real: another table, relay code, retry policy, cleanup job, monitoring, and consumer deduplication.

The value is also real: when the database says something happened, the system has a durable path for telling the rest of the architecture.


Implementation Checklist

Before trusting an outbox in production, confirm:

  • The business row and outbox row are written in the same database transaction.
  • No broker publish call happens inside that transaction.
  • Every event has a stable event ID.
  • The event payload is treated as a consumer-facing contract.
  • The relay claims rows atomically and keeps claim transactions short.
  • Stuck processing rows recover after a lease expires.
  • Publish failures use bounded retry with backoff.
  • Poison events move to a visible dead state.
  • Consumers deduplicate by event ID in the same transaction as their side effect.
  • Ordering requirements are written down per aggregate or per topic.
  • Oldest pending event age is monitored and alerted.
  • Published rows have a retention and cleanup policy.
  • CDC is chosen only when the team can operate the extra infrastructure.

If those checks are missing, the table may exist, but the reliability boundary is still incomplete.


Final Takeaway

The transactional outbox pattern works because it stops pretending that a database write and a broker publish are one atomic operation.

They are not.

Instead, it makes one smaller promise true: the business change and the durable publish intent commit together.

After that, the relay, broker, and consumers still need retries, idempotency, ordering policy, monitoring, and cleanup. That may sound like more work, but it is much better work than discovering later that the event disappeared and every downstream service quietly moved on without it.