Safe Database Migrations in Production

Safe Database Migrations in Production

Database migrations become risky when schema changes are treated like isolated code changes instead of staged compatibility work. Production safety usually depends less on the DDL itself and more on the rollout sequence around it.

Why Straightforward Schema Changes Turn Risky

The schema needs to change.

A column name no longer fits the domain. A nullable field should become required. One table needs to be split. A new index must be added to support a growing query path. The application change itself may be straightforward, but once the database is involved, the rollout becomes risky.

That risk is easy to underestimate.

In many teams, the change looks harmless:

  • update the schema
  • deploy the migration
  • deploy the code
  • move on

Sometimes that works. Often it works only because production timing was favorable.

When it fails, the failure modes are expensive:

  • old application instances still expect the previous schema
  • new code reads fields that are not backfilled yet
  • large table changes lock traffic-critical queries
  • a rollback restores code but not data shape
  • background jobs and consumers continue using the old contract

The difficult part is not writing the migration. The difficult part is changing schema and application behavior without requiring the entire system to switch at once.


The Core Principle

Safe database migrations are usually not single-step changes. They are compatibility rollouts.

The goal is not merely to get the new schema into production. The goal is to keep the system working while multiple realities temporarily coexist:

  • old code and new code
  • old schema and new schema
  • backfilled and not-yet-backfilled data
  • online traffic and long-running background jobs

That is why the safest pattern is usually expand and contract.

Instead of replacing one state with another immediately:

  1. expand the schema in a backward-compatible way
  2. deploy code that can work with both versions
  3. migrate data gradually
  4. switch reads and writes
  5. contract only after old assumptions are gone

This is slower than a one-step migration. It is also how you avoid turning schema evolution into an outage.


What Expand and Contract Looks Like

Suppose users.full_name needs to replace users.name.

A risky rollout looks like this:

  1. rename the column
  2. deploy the app

That fails if:

  • old app instances are still reading name
  • background workers deploy later
  • one service was missed entirely
  • rollback is needed after the schema has already changed

A safer rollout looks like this:

  1. add full_name as a new nullable column
  2. deploy code that writes both name and full_name
  3. backfill existing rows
  4. switch reads from name to full_name
  5. stop writing name
  6. remove name only after the old path is fully gone

This is the core idea: add before switching, switch before removing.


Why Production Migrations Break More Often Than Expected

Schema changes fail in production for reasons that are usually outside the migration file itself.

Multiple application versions coexist

Even if deployment is fast, there is usually a period when:

  • old instances are still serving traffic
  • new instances are starting
  • workers are draining old jobs
  • scheduled jobs are using older code

If the migration is not backward-compatible across that window, deployment order becomes a hidden source of failure.

Data does not change atomically

Adding a new column is immediate. Making all existing rows conform to new assumptions is not.

Backfills take time. During that time, the application must tolerate mixed-state data.

Database changes can block live traffic

Some operations are logically simple but operationally expensive:

  • adding indexes on large tables
  • changing column types
  • setting non-null constraints
  • rewriting defaults
  • dropping columns still used by old queries

A schema migration can succeed technically while still causing lock contention, latency spikes, or connection pileups.

Rollback is asymmetric

Code rollback is often easy. Schema rollback often is not.

Once a column is dropped, a table is rewritten, or a backfill mutates live data, going backward becomes much harder than reverting application code.

That is why safe migrations are designed to be forward-compatible first, not dependent on easy reversal.


A Practical Expand-and-Contract Sequence

The exact steps vary by database and schema change, but the pattern is broadly stable.

Step 1: Expand the schema safely

Introduce new structures without removing old ones.

Examples:

  • add a new nullable column
  • add a new table
  • add a new index
  • add a new enum value before code depends on it

At this stage, old code should continue working unchanged.

Step 2: Deploy compatibility code

Update the application so it can operate across both schema versions.

Depending on the change, that may mean:

  • dual write to old and new columns
  • read from the new column when present, otherwise fall back
  • tolerate both old and new response shapes internally
  • gate new behavior until backfill completes

This is the phase where many teams underinvest. Compatibility code is temporary, but it is what makes the rollout safe.

Feature flags are sometimes used to manage this transition, but they add their own long-term complexity if left unresolved. For that tradeoff, see When Feature Flags Increase System Complexity.

Step 3: Backfill existing data

Move historical rows gradually.

For large tables:

  • process in batches
  • throttle deliberately
  • monitor lock time and replication lag
  • make the backfill resumable

Backfills should behave like production jobs, not one-off scripts that assume perfect conditions.

Step 4: Switch reads

Once data is present and validated, move reads to the new schema path.

Do this only after confirming:

  • new writes are populating the new shape correctly
  • backfill coverage is complete enough
  • fallback behavior is understood

This transition is often safer than removing the old path, but it still needs monitoring.

Step 5: Stop old writes

After reads are stable, remove the dependency on old schema fields for new traffic.

At this stage, old fields may still exist, but they are no longer authoritative.

Step 6: Contract the schema

Only now remove the old column, table, index, or constraint path.

This should happen after:

  • all application instances use the new path
  • old workers and cron jobs are gone
  • monitoring shows no fallback reads
  • rollback no longer depends on the old structure

Schema removal is the last step, not the first real one.


Illustrative Example

Suppose you want to move from status TEXT to a stricter status_code INT.

A safe sequence might look like this:

ALTER TABLE orders ADD COLUMN status_code INT;

Deploy code that writes both fields:

await db.order.update({
  where: { id: orderId },
  data: {
    status: 'paid',
    statusCode: 2,
  },
});

Backfill historical rows:

UPDATE orders
SET status_code = CASE status
  WHEN 'pending' THEN 1
  WHEN 'paid' THEN 2
  WHEN 'failed' THEN 3
END
WHERE status_code IS NULL;

Then switch reads:

function mapOrderStatus(order) {
  if (order.statusCode !== null) {
    return decodeStatus(order.statusCode);
  }

  return order.status;
}

Only after all readers and writers are stable should status be removed.

This approach looks verbose compared to a direct type replacement. It is also far more tolerant of staggered deploys, retries, and rollback pressure.


The Migration Cases That Usually Need Extra Care

Some schema changes are especially likely to cause production problems.

Adding a non-null column to a large table

If the table already has data, the real problem is not adding the column. It is making all existing rows valid without rewriting the table under load.

Usually safer:

  • add as nullable
  • backfill
  • enforce non-null later

Renaming columns used by live code

Database rename support does not remove application compatibility risk. If multiple services or background jobs use the old field, logical breakage can still happen even if the database operation is instant.

Creating indexes on hot tables

Indexes are often necessary, but on large or high-write tables they can introduce write amplification, locking, or rollout slowness.

That tradeoff matters especially when the migration is meant to support a performance fix, as in How to Find and Fix Slow SQL Queries in Production and Why Database Indexes Didn’t Fix Your Slow Query.

Splitting one table into multiple tables

This usually requires:

  • dual writes
  • read fallback logic
  • validation of consistency between the old and new structures

It is an application migration disguised as a schema migration.

Changes that affect public API contracts

If the schema change also changes request or response shape, the rollout has to preserve client compatibility too. That is where migration planning overlaps with API Versioning Without Breaking Clients.


What to Test Before Production

Migration safety is not just “did the SQL run successfully?”

You want confidence in at least four areas.

1. Old code still works after schema expansion

If you add a column or table, confirm the currently deployed version still behaves normally.

2. New code works with mixed-state data

This is where teams often miss bugs. Test rows that are:

  • not backfilled
  • partially backfilled
  • fully backfilled

3. Backfill jobs behave safely under interruption

Pause and resume behavior matters. If the backfill dies halfway, you need to know whether it can restart cleanly.

4. Contract and persistence behavior still hold

If the application boundary is affected, this is a good place for API integration tests that verify real writes, reads, and compatibility behavior. That is the same practical testing boundary discussed in How to Write API Integration Tests.


Monitoring During Rollout

Production migrations should be observed like any other high-risk change.

Watch for:

  • query latency changes
  • lock wait time
  • error rate by endpoint
  • database CPU and I/O
  • replication lag
  • fallback-read frequency
  • backfill progress and retry behavior

If the migration affects critical traffic, define rollback or pause conditions before you start.

“We will see how it looks” is not a rollout strategy.


Common Mistakes

These are the mistakes that repeatedly make schema changes riskier than they need to be:

  • coupling schema removal to the first deploy
  • assuming one deployment wave means one application version
  • backfilling without throttling or resumability
  • enforcing strict constraints before historical data is ready
  • treating rollback as a guaranteed escape hatch
  • forgetting background jobs, admin scripts, and consumers outside the main service
  • shipping the migration without production-oriented monitoring

Most migration failures are not caused by exotic database behavior. They come from sequencing mistakes and hidden dependencies.


Practical Rollout Checklist

Before shipping:

  1. Confirm whether the change is backward-compatible.
  2. Identify all readers and writers, including jobs and secondary services.
  3. Plan expansion, compatibility period, backfill, read switch, and contraction separately.
  4. Decide how to pause safely if metrics worsen.
  5. Test mixed-state data, not just final state.

During rollout:

  1. Expand first.
  2. Deploy compatibility code.
  3. Backfill gradually.
  4. Switch reads with monitoring.
  5. Remove old writes.
  6. Contract only after stability is proven.

This sequence is usually slower than teams want. It is also much faster than debugging a production outage caused by a “simple” migration.


Closing Reflection

Database migrations become dangerous when they are treated as one-time schema edits instead of staged compatibility changes.

Production systems rarely switch state all at once. Code versions overlap, data moves gradually, and rollback pressure appears at the worst possible time.

The expand-and-contract pattern works because it accepts that reality instead of fighting it.

That is what makes a migration safe: not the absence of change, but the ability to change while the system keeps running.