Safe Database Migrations in Production

Safe Database Migrations in Production

Database migrations become dangerous when they are treated like isolated code edits.

The SQL may be simple. The production rollout usually is not.

What makes migrations risky is not just DDL. It is that the system temporarily has to survive multiple realities at once: old code and new code, old schema and new schema, backfilled and not-yet-backfilled rows, and request traffic alongside long-running background jobs.

That is why safe migrations are usually not "run SQL, then deploy code." They are compatibility rollouts.


The Core Principle

The safest default for most production schema changes is expand and contract.

That means:

  1. expand the schema in a backward-compatible way
  2. deploy code that can tolerate both shapes
  3. backfill data gradually
  4. switch reads and writes deliberately
  5. remove the old path only after it is truly unused

This is slower than one-step replacement. It is also how you avoid turning schema evolution into an outage.


Why Straightforward Migrations Break In Production

Consider a seemingly simple change: rename a column, make a field non-null, replace one enum representation with another, split one table into two, or add an index to support a hot query.

In development, this often feels easy: run the migration, run the app, everything works.

In production, failure modes show up around the rollout boundary: old instances still expect the previous schema, workers or cron jobs deploy later than web nodes, backfill has not reached all rows yet, one migration step rewrites or locks a hot table, or rollback restores code but not data shape.

The problem is not that migrations are special. The problem is that production rarely switches state all at once.


A Concrete Example: Renaming A Column Safely

Suppose users.name needs to become users.full_name.

The risky rollout:

  1. rename the column
  2. deploy new code

That breaks if:

  • any old instance still reads name
  • a background job deploys later
  • another service was missed
  • rollback becomes necessary

A safer rollout:

  1. add full_name as a new nullable column
  2. deploy code that writes both name and full_name
  3. backfill historical rows
  4. switch reads to full_name
  5. stop writing name
  6. remove name only after the old path is truly gone

That is the essence of expand and contract:

add before switching, switch before removing


Rollback Is Not A Time Machine

This is one of the most expensive migration lessons teams learn late.

Rolling back application code does not automatically roll back data shape, partial backfills, new columns, new indexes, or mixed-state rows created during the rollout. If the migration introduced a new nullable column and the new code has already started writing to it, the old code may still be able to run. But if the rollout changed assumptions more deeply, a code rollback can leave the system in an awkward in-between state rather than a clean previous state.

That is why rollback planning for migrations should answer two separate questions:

  1. can the previous application version still run safely against the expanded schema?
  2. what durable data changes made during the rollout must now be tolerated or repaired?

The safest migrations are the ones where rollback is boring because compatibility was preserved, not the ones where rollback is heroic.


What Expand And Contract Looks Like In Practice

Step 1: Expand the schema safely

Introduce the new structure without breaking the old path.

Examples include adding a nullable column, a new table, a new index, or a new enum value.

At this stage, the old application should still run unchanged.

Step 2: Deploy compatibility code

Make the application tolerate both shapes.

That may mean dual writing to old and new columns, reading from the new field when present and otherwise falling back, tolerating old and new shapes in internal mapping code, or gating new behavior until data is ready.

This temporary compatibility layer is what makes the rollout safe.

Step 3: Backfill gradually

Historical rows need to move too.

For larger tables, process in batches, throttle deliberately, make the job resumable, and watch lock time and replication lag.

Backfills should behave like production jobs, not one-off scripts that assume perfect conditions.

Step 4: Switch reads

After new writes and enough backfill are verified, move reads to the new shape.

Do this only after confirming that new writes populate the new path correctly, old rows are sufficiently backfilled, and fallback behavior is understood.

Step 5: Stop old writes

Once reads are stable, stop making the old field authoritative.

Step 6: Contract the schema

Only now should you remove old columns, old indexes, old fallback logic, and obsolete compatibility paths.

Removal is the last step, not the first real one.


Another Practical Example

Suppose you want to move from status TEXT to status_code INT.

Start with schema expansion:

ALTER TABLE orders ADD COLUMN status_code INT;

Deploy compatibility code that writes both:

await db.order.update({
  where: { id: orderId },
  data: {
    status: 'paid',
    statusCode: 2,
  },
});

Backfill existing rows:

UPDATE orders
SET status_code = CASE status
  WHEN 'pending' THEN 1
  WHEN 'paid' THEN 2
  WHEN 'failed' THEN 3
END
WHERE status_code IS NULL;

Then switch reads:

function mapOrderStatus(order) {
  if (order.statusCode !== null) {
    return decodeStatus(order.statusCode);
  }

  return order.status;
}

Only after all readers and writers are stable should status be removed.

That looks verbose compared to a direct replacement. It is also much more tolerant of staggered deploys, retries, and rollback pressure.


Backfills Need Their Own Runbook

Backfills are often treated like the simple middle step between schema expansion and read switching. In practice, they are usually the part most likely to create hidden production pressure.

A large backfill can increase replication lag, lengthen lock times, compete with request traffic, and trigger query plans that looked harmless in development. That is why backfill design usually deserves its own operational plan: batch boundaries, pause and resume behavior, checkpointing, throttling, and clear observability during the run.

For important tables, it is often worth deciding ahead of time how the backfill will prove progress and how the system will know when it is safe to continue the rollout. If the answer is "we will watch it manually," the plan is still incomplete.


The Migration Cases That Usually Need Extra Care

Adding a non-null column to a large table

The dangerous part is not adding the column. It is making historical rows valid without rewriting the whole table under load.

The safer sequence is to add it as nullable, backfill it, and enforce non-null later.

Renaming live columns

Database rename support does not remove application compatibility risk. The risk lives in everything still reading the old field.

Creating indexes on hot tables

Indexes are often required, but they can add lock risk, write cost, and rollout pressure on high-traffic tables.

This matters especially when the migration exists to support a performance fix, as in How to Find and Fix Slow SQL Queries in Production and Why Database Indexes Didn’t Fix Your Slow Query.

Splitting tables

This usually requires dual writes, read fallback logic, and validation that old and new structures stay consistent.

That is often an application migration disguised as a schema migration.

Changes that affect public API contracts

If the schema change also alters request or response semantics, the rollout has to preserve client compatibility too. That is where migrations overlap with API Versioning Without Breaking Clients.


Why Mixed-State Data Matters So Much

One of the most common migration mistakes is testing only the final desired state.

Production spends time in intermediate states where rows are not backfilled yet, old and new columns are both present, readers are split between old and new assumptions, and background jobs still use older application code.

That mixed state is where many incidents happen.

It is not just an application concern. Concurrency and transaction visibility can also matter while reads, writes, and backfills overlap. For that side of the model, see SQL Isolation Levels Explained.


What To Test Before Production

Migration safety is not just "did the SQL run?"

You want confidence that:

1. Old code still works after schema expansion

The currently deployed version should survive the expansion step.

2. New code works with mixed-state data

Test rows that are not backfilled, partially backfilled, and fully backfilled.

3. Backfill can pause and resume safely

If the process stops halfway, you should understand how it resumes.

4. Contracts and persistence still hold

If the migration affects the application boundary, integration tests that verify real writes and reads become especially valuable. See How to Write API Integration Tests.


Monitoring During The Rollout

Treat a production migration like any other high-risk change.

Watch query latency, lock wait time, endpoint error rate, database CPU and IO, replication lag, fallback-read frequency, and backfill progress and retry behavior.

If the migration affects critical traffic, define pause conditions before you start.

Good pause conditions are concrete: replication lag crosses a threshold, write latency rises beyond an agreed limit, error rate increases on the affected endpoint, or backfill progress stalls for longer than expected.

"We will see how it looks" is not a rollout strategy.


Common Mistakes

These are the mistakes that repeatedly make migrations riskier than they need to be: removing old schema in the first deploy, assuming one deployment wave means one application version, enforcing strict constraints before history is ready, running backfills without throttling or resumability, forgetting workers, scripts, or secondary services, treating rollback as a guaranteed escape hatch, or shipping the migration without production-oriented monitoring.

Most migration failures are not caused by exotic database behavior. They come from sequencing mistakes and hidden dependencies.


A Practical Rollout Checklist

Before shipping:

  1. confirm whether the change is backward-compatible
  2. identify all readers and writers, including jobs and secondary services
  3. plan expansion, compatibility, backfill, read switch, and contraction separately
  4. decide how to pause safely if metrics worsen
  5. test mixed-state data, not just final state

During rollout:

  1. expand first
  2. deploy compatibility code
  3. backfill gradually
  4. switch reads with monitoring
  5. stop old writes
  6. contract only after stability is proven

This is usually slower than teams want. It is still much faster than debugging a production outage caused by a "simple" migration.


Final Thoughts

Safe database migrations are compatibility rollouts, not just schema edits.

They work when the system can tolerate overlap between old and new code, old and new schema, and old and new data assumptions.

The expand-and-contract pattern works because it accepts that production reality instead of pretending the whole system will switch at once.