Database Migration Rollback Strategy in Production

Database Migration Rollback Strategy in Production

A database migration rollback strategy is not just a down migration file. In production, application code can roll back quickly, but data often cannot return to its previous shape without losing writes, breaking compatibility, or creating a second incident.

The safest rollback plan starts before the migration runs. It says which phases are reversible, which phases require a forward fix, which metrics stop the rollout, and when the old schema must remain in place until the system has proven the new path is stable.

This article focuses on rollback decisions for production database migrations. For the broader rollout pattern, read Safe Database Migrations in Production. For the surrounding database cluster, see SQL And Data Correctness.


Why Database Rollback Is Different

Application rollback usually means serving the previous build again. That is useful when the new code has a bug and the previous build can still run against the current environment.

Database rollback is harder because production data has moved. After a migration starts, several things may be true at the same time:

  • new rows were written in the new shape
  • old rows were partially backfilled
  • some application instances still expect the old shape
  • workers or scheduled jobs are behind the web deploy
  • constraints or indexes changed write behavior
  • downstream systems already observed migrated data

A normal application deploy can often be treated as a switch. A database migration is more like a temporary mixed state. Rollback means returning traffic to a known-compatible behavior, not pretending the mixed state never happened.

The practical question is:

Can the old code, new code, old schema, new schema, and partially migrated rows coexist without corrupting data?

If the answer is yes, rollback is usually manageable. If the answer is no, the next safest action is often a forward fix, not a reverse migration.


Keep Rollback Inside A Compatibility Window

The safest migration rollback window is the period after the schema has been expanded but before old compatibility has been removed.

For a typical column migration, the rollback-friendly sequence looks like this:

PhaseExample changeRollback posture
ExpandAdd full_name while keeping nameOld code still works
CompatibilityNew code writes both columns and can read fallbackCode can roll back safely
BackfillHistorical rows copy name into full_namePause and resume; do not drop old source
Switch readsNew code prefers full_nameReads can switch back to fallback
Stop old writesNew writes only update full_nameRollback needs a deliberate compatibility check
ContractDrop name and remove fallback codeRollback is no longer cheap

This is why destructive work belongs at the end. Dropping a column, deleting a table, removing an old enum value, or deleting fallback code usually ends the easy rollback window.

The migration may still be recoverable after that point, but recovery is a different operation. It may need backup restore, data repair, emergency compatibility code, or a new migration.


Classify Each Migration Step Before Shipping

A useful rollback plan classifies each step by what can be safely undone.

Migration stepUsually safe rollbackHidden risk
Add nullable columnLeave the column in place and roll back codeNew code may already write values old code ignores
Add new tableLeave the table in place and roll back codeSide effects may have been recorded only in the new table
Add indexDrop the index later if neededBuild load, lock waits, invalid index cleanup, write overhead
Add dual writesRoll back code while old writes continueNew values may be incomplete or inconsistent
Backfill historical rowsPause the job and keep compatibility code activeReversing the backfill can erase legitimate new writes
Switch readsSwitch back to fallback or old sourceNew-only rows may no longer have an old-source value
Add constraintDrop or disable the constraint if safeOld code may now produce rows the constraint rejects
Drop old column or tableUsually not a simple rollbackData may be lost or only recoverable from backup

The plan does not need to make every step reversible. It needs to make irreversible steps explicit.

If a migration step cannot be rolled back safely, label it that way in the release notes and require stronger proof before it runs. That proof might be zero fallback reads, no old writers, completed backfill validation, successful shadow reads, or a specific manual approval.


Prefer Forward Fixes After Data Has Moved

Teams often say "rollback" when they mean "recover." Those are not always the same thing.

Rollback is useful when returning to the previous application behavior prevents more damage. Forward fix is usually safer when the new data shape has already become authoritative.

For example, suppose a migration adds users.full_name, dual-writes name and full_name, backfills old rows, and then switches reads to full_name. After the read switch, a bug appears: some new signup path writes full_name as an empty string.

There are two possible responses:

ResponseWhen it helpsWhen it is risky
Roll reads back to nameOld writes are still correct and every affected row has nameNew-only rows or newer write paths no longer maintain name
Forward fix the writer and repair bad rowsThe new column is already the intended source of truthThe repair query is uncertain or would overwrite valid user edits

The safer choice depends on what remains compatible. If old and new values are still maintained together, rollback can be clean. If the old value stopped receiving updates, rolling back reads can show stale data.

This is why the migration plan should define a source of truth for each phase. Without that, rollback decisions become improvised during an incident.


A Concrete Rollback Matrix

Before a migration runs, write a small matrix that answers what rollback means at each phase.

For the users.name to users.full_name migration, it could look like this:

Current phaseStop conditionRollback actionData repair
Schema expandedDDL caused unexpected lock waitsStop rollout; leave full_name in placeNone
Dual write deployedError rate rises in user updatesRoll back application; keep schema expandedInspect rows written during failed deploy
Backfill runningReplication lag or p99 latency crosses thresholdPause backfill; keep dual writes activeResume from checkpoint after capacity recovers
Read switch deployedDisplay names missing or wrongSwitch reads back to fallback pathCompare name and full_name for affected rows
Old writes disabledOld clients or workers still write nameRe-enable compatibility write or deploy forward fixReconcile rows changed during gap
Old column droppedOld code path still existsEmergency compatibility migration or restore pathRecover from backup or event history if needed

This matrix is more useful than a generic down migration because it connects rollback to live system signals. It also makes the dangerous point obvious: after the old column is dropped, rollback is no longer a normal deploy action.

The same discipline applies when the migration touches API behavior. If a request can partially commit database state, the rollback plan should be tested through the API boundary, not only through SQL. That is why Database Transaction Boundaries in Backend APIs and How to Write API Integration Tests are part of the same risk area.


Do Not Blindly Reverse Backfills

Backfills create one of the easiest rollback traps.

The tempting idea is:

UPDATE users
SET full_name = NULL
WHERE migrated_by = 'users_full_name_backfill';

That might be safe only if the backfill is the only writer to full_name. In real systems, application writes, admin edits, imports, and repair jobs may run while the backfill is active.

A reverse update can erase legitimate new values.

Safer backfill recovery usually uses these rules:

  1. keep the old source column or table until the migration is fully proven
  2. make the backfill idempotent so it can run again without changing correct rows
  3. checkpoint progress so the job can pause and resume
  4. record enough metadata to audit what the job touched
  5. repair only rows that match a narrow, verified bad state

For example, a repair query should usually be shaped like this:

UPDATE users
SET full_name = name
WHERE full_name IS NULL
  AND name IS NOT NULL;

That query fills a missing value from a still-valid source. It does not overwrite every row the migration ever touched.

For higher-risk changes, add a migration audit table:

CREATE TABLE migration_row_audit (
  migration_name TEXT NOT NULL,
  table_name TEXT NOT NULL,
  row_id BIGINT NOT NULL,
  old_value TEXT,
  new_value TEXT,
  changed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  PRIMARY KEY (migration_name, table_name, row_id)
);

This is not needed for every small migration. It becomes valuable when a backfill transforms values, changes identifiers, normalizes text, deduplicates rows, or changes money, permissions, ownership, billing, or audit data.


Treat Constraints And Indexes As Rollback Decisions

Schema rollback is not only about columns. Constraints and indexes can change production behavior too.

A new constraint may reject writes that old code still sends. A new unique index may expose duplicate historical data. A new regular index may slow writes enough to move the bottleneck elsewhere. A failed index build may leave cleanup work behind.

For PostgreSQL, exact behavior depends on the command and version, so use the official docs for the specific operation:

Before adding a constraint or index, answer:

QuestionWhy it matters
Can old code still write valid rows?Prevents rollback from immediately failing writes
Is historical data already clean?Prevents validation failure or surprise duplicates
Can this be added without blocking hot traffic?Avoids turning a correctness change into an availability incident
How do we remove or disable it if needed?Makes rollback action concrete
Which metric proves it helped?Prevents keeping speculative indexes or constraints forever

The rollback action may be simple, such as dropping a new index later. The decision is still production-sensitive because dropping or building database objects can consume resources and take locks.


Application Rollback Is Not Enough

A common migration failure starts with a harmless-looking code rollback.

The previous application version comes back. Traffic recovers briefly. Then workers fail because the schema was already contracted, or old code writes stale values because new code stopped maintaining the old column.

Treat application rollback as safe only when these conditions are true:

ConditionWhy it matters
Old code can read the current schemaPrevents missing-column and missing-table failures
Old code can write valid rowsPrevents constraint and enum failures
Old code sees a valid source of truthPrevents stale or contradictory reads
Workers and scripts follow the same compatibility rulePrevents non-web writers from corrupting data
Queued messages from both versions are understoodPrevents delayed jobs from replaying old assumptions

This is especially important in systems with background work. A web deploy may roll back in minutes, while job queues still contain messages created by the failed version. The operational patterns in Background Jobs in Production matter here because delayed work is often where old assumptions survive longest.


Build The Rollback Runbook

A good rollback runbook is short enough to use under pressure. It should name the decision points, not describe the entire migration in prose.

Use a structure like this:

Runbook itemExample
OwnerDatabase migration lead or on-call engineer
Impacted pathsSignup, profile update, admin user edit, receipt worker
Current phaseDual write, backfill, read switch, or contract
Stop signalsError rate, p99 latency, lock waits, replication lag, fallback reads
Safe rollback actionRoll back app only, pause backfill, switch reads back, re-enable dual writes
Unsafe rollback actionDrop new data, reverse backfill, remove constraint without validation
Verification queryCount mismatched rows or rows still using fallback
CommunicationWho must approve contract or destructive repair

The runbook should also include a "do not do this" section. That sounds obvious, but it prevents the most dangerous shortcut during an incident: running a reverse migration that was generated by a framework but never reviewed for production data.

For example:

Do not run the generated down migration after the read switch.
It drops `users.full_name` and would lose writes created after 09:30 UTC.
If the read switch fails, set `USER_NAME_READ_SOURCE=fallback` and page the migration owner.

That one sentence is often more useful than a beautifully generated rollback file.


Verification Queries Make Rollback Safer

Rollback decisions are easier when the migration has direct evidence.

For the users.full_name migration, useful checks might include:

-- Rows still depending on fallback reads.
SELECT count(*)
FROM users
WHERE full_name IS NULL
  AND name IS NOT NULL;
-- Rows where dual writes disagree.
SELECT id, name, full_name, updated_at
FROM users
WHERE name IS DISTINCT FROM full_name
ORDER BY updated_at DESC
LIMIT 100;
-- Long-running transactions that may block schema work.
SELECT pid, state, now() - xact_start AS age, query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY age DESC
LIMIT 20;

These queries are examples, not a universal checklist. The important pattern is that each rollout phase should have a query or metric that proves whether it is safe to move forward, pause, or roll back.

Without verification, teams tend to use time as proof: "It has been running for a day, so it is probably fine." Time is not proof. Quiet metrics and clean validation are proof.


When Rollback Should Stop Being Available

Keeping compatibility forever creates its own risk. Temporary dual writes, fallback reads, migration flags, and old columns make the code harder to reason about. At some point the migration needs to finish.

The key is to make the end of rollback explicit.

Before the contract phase, require evidence such as:

  • fallback reads stayed at zero for a full business cycle
  • all application instances, workers, cron jobs, and scripts have deployed the new path
  • new writes no longer depend on the old schema
  • backfill validation queries return zero mismatches
  • dashboards and alerts use the new field or table
  • backup and restore expectations are understood

After that, the plan should say:

After `users.name` is dropped, rollback to code that reads `name` is not supported.
Recovery requires restoring the column from backup or deploying an emergency compatibility migration.

That statement is not pessimistic. It is honest. It prevents a future incident from discovering the rollback boundary too late.


Practical Checklist

Before the migration:

  1. classify every step as reversible, pauseable, forward-fixable, or destructive
  2. define the compatibility window and the point of no easy rollback
  3. keep old data sources until the new path is proven
  4. write phase-specific rollback actions, not one generic down migration
  5. prepare verification queries for fallback reads, mismatches, locks, and backfill progress
  6. include workers, scripts, queues, imports, and admin tools in the compatibility review
  7. decide when a forward fix is safer than rollback

During rollout:

  1. expand before switching behavior
  2. keep dual writes or fallback reads until rollback risk is low
  3. pause backfills when production health crosses the stop threshold
  4. avoid reverse updates unless the affected row set is narrow and verified
  5. switch reads back before changing data destructively
  6. do not contract while old usage signals are still non-zero

After rollout:

  1. remove temporary compatibility code deliberately
  2. remove old schema only after the contract gate passes
  3. update dashboards, alerts, tests, and runbooks
  4. keep notes about when normal rollback stopped being supported
  5. review the migration for future patterns the team should repeat or avoid

The Short Version

A production database migration rollback strategy should protect compatibility, not blindly reverse SQL.

Keep old and new shapes alive while the system is mixed. Pause backfills instead of undoing them. Use forward fixes after the new data shape becomes authoritative. Move destructive cleanup to the end. Most importantly, write down the exact point where ordinary rollback stops being safe.