
Database Migration Rollback Strategy in Production
A database migration rollback strategy is not just a down migration file. In production, application code can roll back quickly, but data often cannot return to its previous shape without losing writes, breaking compatibility, or creating a second incident.
The safest rollback plan starts before the migration runs. It says which phases are reversible, which phases require a forward fix, which metrics stop the rollout, and when the old schema must remain in place until the system has proven the new path is stable.
This article focuses on rollback decisions for production database migrations. For the broader rollout pattern, read Safe Database Migrations in Production. For the surrounding database cluster, see SQL And Data Correctness.
Why Database Rollback Is Different
Application rollback usually means serving the previous build again. That is useful when the new code has a bug and the previous build can still run against the current environment.
Database rollback is harder because production data has moved. After a migration starts, several things may be true at the same time:
- new rows were written in the new shape
- old rows were partially backfilled
- some application instances still expect the old shape
- workers or scheduled jobs are behind the web deploy
- constraints or indexes changed write behavior
- downstream systems already observed migrated data
A normal application deploy can often be treated as a switch. A database migration is more like a temporary mixed state. Rollback means returning traffic to a known-compatible behavior, not pretending the mixed state never happened.
The practical question is:
Can the old code, new code, old schema, new schema, and partially migrated rows coexist without corrupting data?
If the answer is yes, rollback is usually manageable. If the answer is no, the next safest action is often a forward fix, not a reverse migration.
Keep Rollback Inside A Compatibility Window
The safest migration rollback window is the period after the schema has been expanded but before old compatibility has been removed.
For a typical column migration, the rollback-friendly sequence looks like this:
| Phase | Example change | Rollback posture |
|---|---|---|
| Expand | Add full_name while keeping name | Old code still works |
| Compatibility | New code writes both columns and can read fallback | Code can roll back safely |
| Backfill | Historical rows copy name into full_name | Pause and resume; do not drop old source |
| Switch reads | New code prefers full_name | Reads can switch back to fallback |
| Stop old writes | New writes only update full_name | Rollback needs a deliberate compatibility check |
| Contract | Drop name and remove fallback code | Rollback is no longer cheap |
This is why destructive work belongs at the end. Dropping a column, deleting a table, removing an old enum value, or deleting fallback code usually ends the easy rollback window.
The migration may still be recoverable after that point, but recovery is a different operation. It may need backup restore, data repair, emergency compatibility code, or a new migration.
Classify Each Migration Step Before Shipping
A useful rollback plan classifies each step by what can be safely undone.
| Migration step | Usually safe rollback | Hidden risk |
|---|---|---|
| Add nullable column | Leave the column in place and roll back code | New code may already write values old code ignores |
| Add new table | Leave the table in place and roll back code | Side effects may have been recorded only in the new table |
| Add index | Drop the index later if needed | Build load, lock waits, invalid index cleanup, write overhead |
| Add dual writes | Roll back code while old writes continue | New values may be incomplete or inconsistent |
| Backfill historical rows | Pause the job and keep compatibility code active | Reversing the backfill can erase legitimate new writes |
| Switch reads | Switch back to fallback or old source | New-only rows may no longer have an old-source value |
| Add constraint | Drop or disable the constraint if safe | Old code may now produce rows the constraint rejects |
| Drop old column or table | Usually not a simple rollback | Data may be lost or only recoverable from backup |
The plan does not need to make every step reversible. It needs to make irreversible steps explicit.
If a migration step cannot be rolled back safely, label it that way in the release notes and require stronger proof before it runs. That proof might be zero fallback reads, no old writers, completed backfill validation, successful shadow reads, or a specific manual approval.
Prefer Forward Fixes After Data Has Moved
Teams often say "rollback" when they mean "recover." Those are not always the same thing.
Rollback is useful when returning to the previous application behavior prevents more damage. Forward fix is usually safer when the new data shape has already become authoritative.
For example, suppose a migration adds users.full_name, dual-writes name and full_name, backfills old rows, and then switches reads to full_name. After the read switch, a bug appears: some new signup path writes full_name as an empty string.
There are two possible responses:
| Response | When it helps | When it is risky |
|---|---|---|
Roll reads back to name | Old writes are still correct and every affected row has name | New-only rows or newer write paths no longer maintain name |
| Forward fix the writer and repair bad rows | The new column is already the intended source of truth | The repair query is uncertain or would overwrite valid user edits |
The safer choice depends on what remains compatible. If old and new values are still maintained together, rollback can be clean. If the old value stopped receiving updates, rolling back reads can show stale data.
This is why the migration plan should define a source of truth for each phase. Without that, rollback decisions become improvised during an incident.
A Concrete Rollback Matrix
Before a migration runs, write a small matrix that answers what rollback means at each phase.
For the users.name to users.full_name migration, it could look like this:
| Current phase | Stop condition | Rollback action | Data repair |
|---|---|---|---|
| Schema expanded | DDL caused unexpected lock waits | Stop rollout; leave full_name in place | None |
| Dual write deployed | Error rate rises in user updates | Roll back application; keep schema expanded | Inspect rows written during failed deploy |
| Backfill running | Replication lag or p99 latency crosses threshold | Pause backfill; keep dual writes active | Resume from checkpoint after capacity recovers |
| Read switch deployed | Display names missing or wrong | Switch reads back to fallback path | Compare name and full_name for affected rows |
| Old writes disabled | Old clients or workers still write name | Re-enable compatibility write or deploy forward fix | Reconcile rows changed during gap |
| Old column dropped | Old code path still exists | Emergency compatibility migration or restore path | Recover from backup or event history if needed |
This matrix is more useful than a generic down migration because it connects rollback to live system signals. It also makes the dangerous point obvious: after the old column is dropped, rollback is no longer a normal deploy action.
The same discipline applies when the migration touches API behavior. If a request can partially commit database state, the rollback plan should be tested through the API boundary, not only through SQL. That is why Database Transaction Boundaries in Backend APIs and How to Write API Integration Tests are part of the same risk area.
Do Not Blindly Reverse Backfills
Backfills create one of the easiest rollback traps.
The tempting idea is:
UPDATE users
SET full_name = NULL
WHERE migrated_by = 'users_full_name_backfill';
That might be safe only if the backfill is the only writer to full_name. In real systems, application writes, admin edits, imports, and repair jobs may run while the backfill is active.
A reverse update can erase legitimate new values.
Safer backfill recovery usually uses these rules:
- keep the old source column or table until the migration is fully proven
- make the backfill idempotent so it can run again without changing correct rows
- checkpoint progress so the job can pause and resume
- record enough metadata to audit what the job touched
- repair only rows that match a narrow, verified bad state
For example, a repair query should usually be shaped like this:
UPDATE users
SET full_name = name
WHERE full_name IS NULL
AND name IS NOT NULL;
That query fills a missing value from a still-valid source. It does not overwrite every row the migration ever touched.
For higher-risk changes, add a migration audit table:
CREATE TABLE migration_row_audit (
migration_name TEXT NOT NULL,
table_name TEXT NOT NULL,
row_id BIGINT NOT NULL,
old_value TEXT,
new_value TEXT,
changed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (migration_name, table_name, row_id)
);
This is not needed for every small migration. It becomes valuable when a backfill transforms values, changes identifiers, normalizes text, deduplicates rows, or changes money, permissions, ownership, billing, or audit data.
Treat Constraints And Indexes As Rollback Decisions
Schema rollback is not only about columns. Constraints and indexes can change production behavior too.
A new constraint may reject writes that old code still sends. A new unique index may expose duplicate historical data. A new regular index may slow writes enough to move the bottleneck elsewhere. A failed index build may leave cleanup work behind.
For PostgreSQL, exact behavior depends on the command and version, so use the official docs for the specific operation:
- PostgreSQL ALTER TABLE for constraints, column changes, and locking behavior
- PostgreSQL CREATE INDEX for
CONCURRENTLYbehavior and limitations - PostgreSQL explicit locking for lock modes and blocking risk
Before adding a constraint or index, answer:
| Question | Why it matters |
|---|---|
| Can old code still write valid rows? | Prevents rollback from immediately failing writes |
| Is historical data already clean? | Prevents validation failure or surprise duplicates |
| Can this be added without blocking hot traffic? | Avoids turning a correctness change into an availability incident |
| How do we remove or disable it if needed? | Makes rollback action concrete |
| Which metric proves it helped? | Prevents keeping speculative indexes or constraints forever |
The rollback action may be simple, such as dropping a new index later. The decision is still production-sensitive because dropping or building database objects can consume resources and take locks.
Application Rollback Is Not Enough
A common migration failure starts with a harmless-looking code rollback.
The previous application version comes back. Traffic recovers briefly. Then workers fail because the schema was already contracted, or old code writes stale values because new code stopped maintaining the old column.
Treat application rollback as safe only when these conditions are true:
| Condition | Why it matters |
|---|---|
| Old code can read the current schema | Prevents missing-column and missing-table failures |
| Old code can write valid rows | Prevents constraint and enum failures |
| Old code sees a valid source of truth | Prevents stale or contradictory reads |
| Workers and scripts follow the same compatibility rule | Prevents non-web writers from corrupting data |
| Queued messages from both versions are understood | Prevents delayed jobs from replaying old assumptions |
This is especially important in systems with background work. A web deploy may roll back in minutes, while job queues still contain messages created by the failed version. The operational patterns in Background Jobs in Production matter here because delayed work is often where old assumptions survive longest.
Build The Rollback Runbook
A good rollback runbook is short enough to use under pressure. It should name the decision points, not describe the entire migration in prose.
Use a structure like this:
| Runbook item | Example |
|---|---|
| Owner | Database migration lead or on-call engineer |
| Impacted paths | Signup, profile update, admin user edit, receipt worker |
| Current phase | Dual write, backfill, read switch, or contract |
| Stop signals | Error rate, p99 latency, lock waits, replication lag, fallback reads |
| Safe rollback action | Roll back app only, pause backfill, switch reads back, re-enable dual writes |
| Unsafe rollback action | Drop new data, reverse backfill, remove constraint without validation |
| Verification query | Count mismatched rows or rows still using fallback |
| Communication | Who must approve contract or destructive repair |
The runbook should also include a "do not do this" section. That sounds obvious, but it prevents the most dangerous shortcut during an incident: running a reverse migration that was generated by a framework but never reviewed for production data.
For example:
Do not run the generated down migration after the read switch.
It drops `users.full_name` and would lose writes created after 09:30 UTC.
If the read switch fails, set `USER_NAME_READ_SOURCE=fallback` and page the migration owner.
That one sentence is often more useful than a beautifully generated rollback file.
Verification Queries Make Rollback Safer
Rollback decisions are easier when the migration has direct evidence.
For the users.full_name migration, useful checks might include:
-- Rows still depending on fallback reads.
SELECT count(*)
FROM users
WHERE full_name IS NULL
AND name IS NOT NULL;
-- Rows where dual writes disagree.
SELECT id, name, full_name, updated_at
FROM users
WHERE name IS DISTINCT FROM full_name
ORDER BY updated_at DESC
LIMIT 100;
-- Long-running transactions that may block schema work.
SELECT pid, state, now() - xact_start AS age, query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY age DESC
LIMIT 20;
These queries are examples, not a universal checklist. The important pattern is that each rollout phase should have a query or metric that proves whether it is safe to move forward, pause, or roll back.
Without verification, teams tend to use time as proof: "It has been running for a day, so it is probably fine." Time is not proof. Quiet metrics and clean validation are proof.
When Rollback Should Stop Being Available
Keeping compatibility forever creates its own risk. Temporary dual writes, fallback reads, migration flags, and old columns make the code harder to reason about. At some point the migration needs to finish.
The key is to make the end of rollback explicit.
Before the contract phase, require evidence such as:
- fallback reads stayed at zero for a full business cycle
- all application instances, workers, cron jobs, and scripts have deployed the new path
- new writes no longer depend on the old schema
- backfill validation queries return zero mismatches
- dashboards and alerts use the new field or table
- backup and restore expectations are understood
After that, the plan should say:
After `users.name` is dropped, rollback to code that reads `name` is not supported.
Recovery requires restoring the column from backup or deploying an emergency compatibility migration.
That statement is not pessimistic. It is honest. It prevents a future incident from discovering the rollback boundary too late.
Practical Checklist
Before the migration:
- classify every step as reversible, pauseable, forward-fixable, or destructive
- define the compatibility window and the point of no easy rollback
- keep old data sources until the new path is proven
- write phase-specific rollback actions, not one generic down migration
- prepare verification queries for fallback reads, mismatches, locks, and backfill progress
- include workers, scripts, queues, imports, and admin tools in the compatibility review
- decide when a forward fix is safer than rollback
During rollout:
- expand before switching behavior
- keep dual writes or fallback reads until rollback risk is low
- pause backfills when production health crosses the stop threshold
- avoid reverse updates unless the affected row set is narrow and verified
- switch reads back before changing data destructively
- do not contract while old usage signals are still non-zero
After rollout:
- remove temporary compatibility code deliberately
- remove old schema only after the contract gate passes
- update dashboards, alerts, tests, and runbooks
- keep notes about when normal rollback stopped being supported
- review the migration for future patterns the team should repeat or avoid
The Short Version
A production database migration rollback strategy should protect compatibility, not blindly reverse SQL.
Keep old and new shapes alive while the system is mixed. Pause backfills instead of undoing them. Use forward fixes after the new data shape becomes authoritative. Move destructive cleanup to the end. Most importantly, write down the exact point where ordinary rollback stops being safe.