How to seed a Postgres database: the 5 ways teams actually do it
An honest tour of the five common ways engineering teams seed a Postgres database for local dev and tests — what each costs, when each breaks, and how to pick one without regretting it three sprints later.
Every team eventually has the same conversation. Someone joins, runs the app locally, opens it in a browser, and stares at an empty screen. No users. No orders. No anything. They Slack the channel: "what do you do for test data?"
And someone — usually whoever's been there longest — sighs.
There are basically five answers to that question. I've used all of them. They all work. They all also have a specific way of going wrong that you only learn once you're three months in. So if you're picking a strategy for the first time, or you've got one that's quietly bleeding hours every week, here's the honest tour.
We'll keep this scoped to Postgres for local development and test environments — not staging fixtures, not production backfills, not load testing. Just the everyday "I need a database with stuff in it so I can build the feature" problem.
The five ways
- Hand-rolled
seed.sql - Factories in your test framework
pg_dumpfrom production, then sanitize- A Faker script
- A schema-aware generator
I'll go through each, then end with a small decision matrix for picking between them.
1. Hand-rolled seed.sql
The classic. A SQL file in the repo, full of INSERT statements, run after migrations.
INSERT INTO users (id, email, name) VALUES
(1, 'alice@example.com', 'Alice'),
(2, 'bob@example.com', 'Bob');
INSERT INTO orders (id, user_id, total) VALUES
(1, 1, 4200);
Why teams reach for it: zero dependencies. It's just SQL. New devs can read it and immediately understand what data exists. Reproducible. Diff-able in PRs.
Where it breaks: the day someone adds a NOT NULL column on a Friday afternoon.
The file is tightly coupled to the schema. Every migration that touches a column or constraint is also a migration that has to update seed.sql — and humans forget. So you end up in this state where the seed file works on main but breaks on three open branches, or it works on your branch but conflicts with the one that just merged. The pain compounds with team size.
It also does nothing for breadth. Five users and three orders is enough to log in. It's not enough to find the bug where the orders table renders weirdly when a user has more than 50.
Use it if: you're early, the schema is small, you've got fewer than four contributors, and you don't need realistic volume.
Skip it if: the schema changes more than once a week or you've already had two "Friday seed-script" incidents.
2. Factories in your test framework
The pattern most test suites end up with. You define a factory per table — userFactory(), orderFactory(user) — and tests build the data they need at the start of each case.
const user = await userFactory({ email: "alice@example.com" });
const order = await orderFactory({ userId: user.id, total: 4200 });
Why teams reach for it: test isolation. Each test owns its data, so tests don't leak into each other. Refactor-friendly because the factory is code, not SQL — when the schema changes, you fix one function instead of grepping INSERT statements.
Where it breaks: local development.
Factories are designed for the test lifecycle: build a tiny world, assert against it, tear it down. That's the wrong shape for "I want to open the app and click around." For local dev, you want a standing database with a coherent body of data. Bolting a "seed mode" onto factories is doable but always feels like you're using the wrong tool — you end up writing a script that calls factories in a particular order, and now you've reinvented seed.sql with extra steps.
The other quiet cost: factories tend to produce data that's just realistic enough to pass a test. email = "test1@example.com". name = "Test User". Which is fine for assertions and miserable for catching the visual bugs that only show up when names are 47 characters with an accent.
Use it if: you already have it for tests. (You probably do, and you should.)
Skip it as your primary dev seeding strategy — it'll always feel like a side effect.
3. pg_dump from production, then sanitize
This one is so tempting. Real data. All the edge cases your factories never thought of. The exact distributions of the actual users you have.
The recipe is roughly: pg_dump from prod, run a sanitization script that masks emails / scrambles names / nukes payment info, share the resulting .sql (or .dump) somewhere your team can pull it.
Why teams reach for it: it works on day one. Bugs reproduce because the data is the data. Onboarding looks magical.
Where it breaks: several places, in increasing order of severity.
- Schema drift. The dump is a snapshot. Two weeks later the schema has moved on, the dump has a column that no longer exists, and the dev who runs
psql < dump.sqlgets a 200-line error. - Size. Your prod DB is 80 GB. Nobody wants to download 80 GB to run a feature flag locally. So you start writing scripts to shrink it: keep 1% of users, then their orders, then their line items, plus a referentially-consistent slice of the rest. Congratulations, you are now maintaining a homegrown synthetic-data tool.
- Compliance. This is the one that should make you stop. The moment you copy production data onto laptops, you have created a new attack surface. One missed PII column in your sanitize script — an
address_line_2, anotesfield, ametadataJSONB blob — and the most sensitive thing in your company is sitting in~/Downloadson three different machines. GDPR fines start at 4% of global revenue. SOC 2 auditors will not like this conversation.
If your data is meaningfully sensitive (and most B2B SaaS data is), this strategy has a ceiling that gets lower every year as compliance tightens.
Use it if: you're pre-PMF, the data isn't sensitive, and you don't have EU users.
Skip it if: you handle PII, payment data, health data, or you're planning to be in business in two years.
4. A Faker script
The middle path. Write a Node/Python/Ruby script that loops through your tables and inserts rows using a library like Faker.js or Faker (Python).
for (let i = 0; i < 100; i++) {
await db.users.insert({
email: faker.internet.email(),
name: faker.person.fullName(),
});
}
Why teams reach for it: more realistic than seed.sql, more volume than factories, no compliance risk. You can produce 10,000 users with believable names and emails in a few seconds.
Where it breaks: foreign keys.
A Faker script doesn't know your schema. You know your schema, and you've encoded that knowledge by hand in the order you call the inserts: users first, then orders, then line items. That's fine on day one. It is not fine on day 200, after you've added thirty more tables and eight join tables and one self-referential manager_id on the employees table. By then your script is a 600-line dependency-ordering puzzle that one person on the team understands and everyone else is afraid to touch.
You also rebuild it every time the schema changes. New table? Add a section. Renamed column? Find the three places it's referenced. New NOT NULL constraint on a column you weren't populating? Production-shaped fire.
And like factories, Faker scripts tend to produce data that's generic-realistic. Every email matches firstname.lastname@example.com. Every postal code is a US ZIP. Every phone number is (555) 123-4567. Good enough until the day you ship to Germany and discover none of the dev data has a non-ASCII character anywhere.
Use it if: your schema is mostly stable, FK relationships are simple, and one person is willing to own the script.
Skip it if: you've already added the same column to the script three times this quarter.
5. A schema-aware generator
The newest category. Tools that read your live schema, infer what each column should contain (from name, type, and constraints), and generate a referentially-consistent dataset for you. No hand-maintained scripts. No dump files.
This is the category seedkit lives in, alongside Seedfast (Postgres-only, AI-based), Tonic, Gretel, and Mostly AI (the enterprise end of the market — pricing starts in the five figures).
The shape of the workflow:
npx seedkit --url "postgres://localhost/myapp_dev" --rows 5000
The tool introspects your schema, sorts tables by foreign-key dependencies (handling cycles, composite keys, deferred constraints — the gnarly stuff), picks a generator per column based on name and type, and inserts. You get a database that looks like your schema as it exists today, not your schema as it existed when someone last updated seed.sql.
Why teams reach for it: zero maintenance after the schema changes. The next migration just works, because the tool re-reads the schema on every run. No PII risk because no production data is involved. Realistic enough for visual / volume bugs because the data is generated against actual constraints — NOT NULL is honored, CHECK constraints are honored, foreign keys point to rows that exist.
Where it breaks: wherever the tool's heuristics don't match your domain.
A column called email is easy. A column called metadata is hard — it's a JSONB blob and the tool has no idea what shape to put in it. Most schema-aware generators give you an escape hatch (a config file, a per-column override, a callback) for the columns where the heuristics misfire. The good ones make this rare; the bad ones make it constant.
The other consideration is determinism. If you re-run the tool, do you get the same data? You probably want yes — same data on every machine means bugs reproduce — but not every tool in this category is deterministic. Worth checking before you commit.
Use it if: your schema changes regularly, you have foreign keys, you have compliance constraints, or you've already tried two of the strategies above and lost faith.
Skip it if: your schema is five tables and isn't going to grow.
How to pick
Quick gut-check matrix:
| If your team... | Probably use |
|---|---|
| Is 1–3 people, schema rarely changes | seed.sql |
| Has a strong test suite already | Factories for tests, plus something else for dev |
| Handles no sensitive data, pre-PMF | pg_dump + sanitize (with an exit plan) |
| Has a stable schema and one motivated owner | Faker script |
| Has a moving schema, FKs, or compliance | Schema-aware generator |
The honest pattern I see most often: teams start with seed.sql, graduate to a Faker script around month six, hit a wall around month eighteen when the FK graph gets gnarly, and then go shopping for a schema-aware tool. You can save yourself the middle two stops if you know the shape of where you're heading.
A note on what we actually did
If you've read this far you probably already guessed that I built seedkit because I'd been through the same arc twice and didn't want to do it a third time. It's the schema-aware option for the indie/startup tier — the gap between Faker (free, brittle) and Tonic (fantastic, $50k/year). One CLI command, FK-correct, deterministic, no production data ever.
But the goal of this post wasn't to sell you that. The goal was to give you the map. Whichever of the five you pick, pick it on purpose — not because it was already there when you joined.
Good luck with the empty database.
Keep reading
Hello, seedkit blog
Welcome to the seedkit blog — short notes on synthetic data, Postgres, and shipping faster.
Seedkit: realistic Postgres data without copying production
Why I built seedkit — a CLI that reads your Postgres schema fresh each run, generates realistic FK-correct data, and hands you a connection string. Without leaking prod, without breaking on schema drift.
Most columns never see an LLM
How seedkit's four-tier strategy stack handles most columns with rule-based fakers — and why the LLM is a last resort, not a first instinct. Plus the deterministic shim, the cache key recipe, and what the approach doesn't solve.