Architecture13 min read

How to Run Zero-Downtime Database Migrations on Postgres

Most outages I see at growth-stage SaaS are not feature bugs — they are migrations gone wrong. Here is the exact playbook I use to ship schema changes on Postgres without taking the database down.

Krishan K Agarwal

Senior System Architect & Fractional CTO

Updated May 2026Published Apr 2026

On this page

The fastest way to take down a SaaS in production is not a bad deploy. It is a migration that took an ACCESS EXCLUSIVE lock on a 50M-row table at 2pm on a Tuesday because someone wrote ALTER TABLE users ADD COLUMN ... DEFAULT 'foo'. Postgres dutifully started rewriting the entire table, every write blocked, the connection pool filled, the API timed out, and now there is an incident channel.

I have helped clean up exactly this situation more times than I can count. The good news: zero-downtime migrations on Postgres are a solved problem, but the solution is procedural — a sequence of small, boring steps — not a clever one-liner. This is the playbook I use, end to end, with the exact commands and a worked example at the end.

Why migrations cause downtime in the first place

Postgres migrations cause downtime for one of three reasons: the migration takes an exclusive lock on a hot table, it scans a huge table in a single transaction and blows out replication lag, or the deploy ships application code that assumes the new schema before the migration is complete. Every zero-downtime technique below addresses one of these three failure modes.

The lock issue is the most common. Postgres uses lock levels (ACCESS SHARE up to ACCESS EXCLUSIVE) and many DDL statements take stronger locks than developers expect. ALTER TABLE ADD COLUMN with a non-null default rewrites the whole table on Postgres versions before 11. CREATE INDEX (without CONCURRENTLY) blocks all writes. ADD CONSTRAINT NOT NULL scans the entire table while holding ACCESS EXCLUSIVE. These are landmines.

The expand-contract pattern

Expand-contract is the only migration pattern I trust on a hot production database. It splits a single 'rename this column' or 'change this type' migration into three or four small, independently deployable steps. Each step is reversible. No step requires the application and the database to ship at the same instant.

Expand. Add the new schema (column, index, table) alongside the old. Application still reads and writes the old shape. Ship and verify.
Dual-write. Update the application to write to both old and new columns on every insert and update. Old reads still work. Ship and verify.
Backfill. Run a batched job that copies historical rows from old to new, 1,000 to 10,000 rows per batch, with sleeps between batches to avoid replication lag.
Swap reads. Update the application to read from the new column. Old column is still being written but no longer read. Ship and verify for at least 24 hours.
Contract. Stop dual-writing, then drop the old column. This is the only irreversible step.

The discipline is to never compress two of these steps into one deploy. The whole point is that if step 4 reveals a bug, you can roll back without touching the database. If you ship dual-write and read-swap together, you have lost that property.

The Postgres-specific techniques you must know

Three techniques cover 90 percent of real migrations. Get these right and most schema changes become routine.

ADD COLUMN nullable, then backfill

On Postgres 11 and later, ADD COLUMN with a constant default is metadata-only and instant. ADD COLUMN with a volatile default (gen_random_uuid(), now()) still rewrites the table. Always add the column nullable, backfill in batches, then add the NOT NULL constraint via NOT VALID + VALIDATE CONSTRAINT to avoid the table scan under lock.

CREATE INDEX CONCURRENTLY (always)

Never run CREATE INDEX in production. Always CREATE INDEX CONCURRENTLY. It cannot run inside a transaction, it is roughly 2-3x slower than the locking version, and if it fails you are left with an INVALID index that you must drop and rebuild. Worth every bit of complexity.

Use pg_repack for table rewrites

VACUUM FULL takes an exclusive lock for the duration. On a 100 GB table that is hours of downtime. pg_repack achieves the same physical reorganization (reclaim bloat, change CLUSTER order) while reads and writes continue. Same for changing a column type when you cannot use the expand-contract path. This is one of the most underrated Postgres tools.

Tools that automate the pattern

You can hand-roll expand-contract on every migration, or you can use a tool that automates it. For small teams shipping a handful of migrations a month, hand-rolling is fine and educational. For teams shipping 50+ migrations a month or working with large hot tables, a tool pays for itself fast.

Tool / Approach	Locks writes?	Postgres-native?	Best for
Hand-rolled expand-contract	No (if done right)	Yes	Small teams, simple migrations, full control
pgroll (Xata)	No	Yes (uses views)	Schema changes with automatic rollback, complex column type changes
Reshape	No	Yes	Similar to pgroll, Rust-based, declarative migrations
gh-ost (MySQL)	No	MySQL only	Reference for MySQL teams; uses binlog instead of triggers
pg_repack	No	Yes	Table rewrites, bloat removal, column type changes via copy
Naive ALTER TABLE	Yes (often catastrophically)	Yes	Tables under 100K rows, off-peak only, never trust this in production

Comparison of zero-downtime migration tools and approaches for Postgres in 2026

pgroll is the tool I most often recommend in 2026. It uses Postgres views to expose the new and old schema simultaneously and applies changes via triggers, so old and new application code can run against the same database during a multi-day rollout. The trade-off is added complexity in your migration tooling — worth it for hot tables, overkill for an admin schema.

Worked example: migrating users.email to citext, 30M rows

A real migration from a recent client engagement. The team needed to make email lookups case-insensitive on a 30M-row users table. The naive path — ALTER COLUMN email TYPE citext — would have rewritten the table under ACCESS EXCLUSIVE for hours. Here is the actual playbook we shipped, with zero downtime, over six days.

Day 1 — Expand. ALTER TABLE users ADD COLUMN email_ci citext (nullable, no default). Instant metadata change. CREATE UNIQUE INDEX CONCURRENTLY users_email_ci_idx ON users (email_ci) WHERE email_ci IS NOT NULL. Took 47 minutes in the background, no impact on writes.
Day 2 — Dual-write. Deploy application code that writes both email and email_ci on every insert and update. New users now have both columns populated. Old rows still have email_ci NULL.
Day 2-4 — Backfill. Batched job: UPDATE users SET email_ci = email::citext WHERE id BETWEEN $1 AND $1 + 5000 AND email_ci IS NULL. 5,000 rows per batch, 200ms sleep between batches, ran for ~36 hours total. Replication lag never exceeded 800ms.
Day 4 — Validate. Confirm zero rows where email_ci IS NULL. Confirm zero rows where email_ci != email::citext. Spot-check application reads.
Day 5 — Swap reads. Deploy application code that reads from email_ci for lookups. Old email column still being written but no longer queried by the lookup path. Monitor for 24 hours.
Day 6 — Contract. Drop the old unique index on email. Stop writing to email in application code. ALTER TABLE users DROP COLUMN email. Done.

Total wall-clock time: six days. Total downtime: zero seconds. Total developer time: about two days of focused work spread over the week. Rollback was possible at every step until day 6. The same migration with a naive ALTER TABLE would have required a maintenance window of multiple hours.

Common migration anti-patterns

Mistakes I see repeatedly when auditing growth-stage SaaS architectures.

Running ALTER TABLE during peak hours. Even fast migrations can starve under high lock contention. Always migrate during low-traffic windows.
Backfilling in a single UPDATE statement. UPDATE users SET email_ci = email::citext touches 30M rows in one transaction, blows out WAL, and lags replicas by hours. Always batch.
Adding NOT NULL with a fresh default in one step. Use NOT VALID then VALIDATE CONSTRAINT to avoid the table scan under exclusive lock.
Forgetting that CREATE INDEX CONCURRENTLY can leave INVALID indexes behind. Always check pg_indexes for invalid entries after a CONCURRENTLY build that may have failed.
No lock_timeout. A migration that waits 20 minutes behind a long-running query is now blocking every other write that arrives during that wait.
Skipping the verification step between expand-contract phases. Each phase must be observed in production for at least an hour before moving to the next.
Treating the application deploy and the database migration as a single atomic event. They are not, and pretending otherwise is how you ship deploys that fail in the gap between schema change and code change.

When to bring in help

Most teams can run zero-downtime migrations themselves once they have the playbook. The cases where I get called in are usually the high-stakes ones: a migration on a 100M+ row table, a column type change that breaks dual-write semantics, or a database that has never had a migration framework and needs one introduced without breaking anything.

If you are sitting on a migration you have been afraid to run for three weeks, that is a signal. The longer you wait, the more code you ship that depends on the old schema, and the more painful the eventual change becomes. Pair this playbook with my multi-tenant SaaS architecture guide if your migration also crosses tenant boundaries — those add their own complications.

Frequently asked questions

Can I really run schema migrations on a 30M-row table without downtime?

Yes, on Postgres, with the right pattern. The key rules: never block writes (use ADD COLUMN nullable, CREATE INDEX CONCURRENTLY, never ALTER TABLE on a hot table), backfill in batches of 1,000 to 10,000 rows, and use the expand-contract pattern so old and new code can run side by side during the transition. With 30M rows, the full process typically takes 2 to 6 hours of background work and zero seconds of downtime.

What is the expand-contract migration pattern?

Expand-contract has three phases. Expand: add the new schema (column, table, index) without removing anything. Migrate: dual-write to both old and new shape, backfill historical data, validate. Contract: switch reads to the new shape, then drop the old one. Each phase ships independently and is reversible, which is what makes it safe.

When should I use pgroll or Reshape instead of writing it myself?

Use a tool when the migration affects a hot, large, or write-heavy table and you do not want to hand-roll the dual-write logic. pgroll (Xata) automates the expand-contract pattern via Postgres views and is genuinely useful at scale. Reshape does similar work. For a small table (under 1M rows) on a low-traffic schema, the manual pattern is fine and more transparent.

What is the difference between CREATE INDEX and CREATE INDEX CONCURRENTLY?

CREATE INDEX takes an ACCESS EXCLUSIVE lock on the table for the duration of the build, which means no writes for minutes or hours on a large table. CREATE INDEX CONCURRENTLY runs in the background, takes a much weaker lock, and lets writes continue. It is roughly 2 to 3 times slower in wall-clock time but does not cause an outage. Always use CONCURRENTLY in production.

What does pg_repack do that VACUUM FULL does not?

VACUUM FULL rewrites a table to reclaim bloat but takes an ACCESS EXCLUSIVE lock the entire time — outage. pg_repack does the same physical reorganization while letting reads and writes continue, by maintaining a shadow table and applying changes via triggers, then atomically swapping. For tables over a few million rows on a production database, pg_repack is the only sane option.

DatabasePostgresMigrations

Architecture

Postgres vs MongoDB for Startups: Don't Get This Wrong

Most startups pick MongoDB because it sounds modern, then spend 18 months regretting it. Here is the honest case for Postgres in 2026, where MongoDB still wins, and the migration cost when you switch.

10 min readRead

Architecture

Monolith vs Microservices for Early-Stage Startups (2026 Honest Take)

Microservices kill more startups than they save. Ninety-five percent of seed and Series A companies should ship a modular monolith. Here is the honest breakdown of when each architecture wins.

12 min readRead

Architecture

7 Architecture Mistakes That Kill Startups (and How to Avoid Them)

After auditing more than thirty startup codebases, the same seven mistakes show up over and over. Each is fixable cheap on day one and brutal once you have customers.

12 min readRead

Want a senior eye on your stack?

If you are scoping an MVP, scaling a SaaS, or staring at an inherited codebase, book a 30-minute call. No pitch deck required.

Book a strategy call See architecture audit

Why migrations cause downtime in the first place

The expand-contract pattern

The Postgres-specific techniques you must know

ADD COLUMN nullable, then backfill

CREATE INDEX CONCURRENTLY (always)

Use pg_repack for table rewrites

Tools that automate the pattern

Worked example: migrating users.email to citext, 30M rows

Common migration anti-patterns

When to bring in help

Frequently asked questions

Related articles

Postgres vs MongoDB for Startups: Don't Get This Wrong

Monolith vs Microservices for Early-Stage Startups (2026 Honest Take)

7 Architecture Mistakes That Kill Startups (and How to Avoid Them)

Want a senior eye on your stack?