Most columns never see an LLM

How seedkit's four-tier strategy stack handles most columns with rule-based fakers — and why the LLM is a last resort, not a first instinct. Plus the deterministic shim, the cache key recipe, and what the approach doesn't solve.

April 29, 20267 min readEngineering

Here's a thing people assume about seedkit: that every cell in your seeded database came out of an LLM. Schema's got 8 tables, 50 columns, you ran one command — surely the model wrote all of it.

Most of it never went near a model. A typical 14,000-row dataset on my own laptop hits the LLM for maybe a few thousand cells out of nearly three hundred thousand. The other ~95% comes out of code so boring I'd be embarrassed to show you.

That's the whole secret to making this fast, cheap, and deterministic. The boring layer carries almost everything.

The strategy stack

For every column in your schema, seedkit picks a generator from a four-tier ladder. The first tier that fits wins.

Tier 1: typed-name match

The column is named email and typed text. There's a faker for that. Pick faker.internet.email(), lowercase it, done.

This sounds dumb. It's the most powerful tier. The space of column names devs actually write is small — email, phone, first_name, street_address, city, country, created_at, slug, ip, password_hash, avatar_url, a few dozen more. Every one of those gets a one-line rule. About half of every schema I've seen hits Tier 1. Free. Deterministic. Done in microseconds.

The trick is being conservative. name alone doesn't match — could be a person's name, could be a category name, could be a tag. We require user-table context (email or username also present in the same table) before treating name as a person. Better to skip a column you could have faked than to fake one you shouldn't have.

Tier 2: type-shape inference

For columns Tier 1 doesn't catch but the type tells us something: numeric(10, 2) NOT NULL CHECK (amount > 0) is clearly money; int CHECK (rating BETWEEN 1 AND 5) is a rating; jsonb with a known shape gets a structurally valid object; an ENUM gets one of its values uniformly at random.

This tier is mostly the type system doing the work. We just have to read it carefully — including the constraints, including the check expressions, including the comment if you wrote one.

Tier 3: domain LLM prompt

The column is bio, the type is text, and there's no rule that knows what a bio looks like. Now we ask the model — but not for one bio. We batch the column for, say, 50 rows, send the LLM the table context (so it knows the bio belongs to a "user" with name and job_title already filled), and ask for 50 bios in a JSON array.

This is the only tier that costs money, and it's the smallest one in practice. On a typical CRM schema, Tier 3 columns are usually under 10% of total cells.

Tier 4: foreign-key lookup

The column is a UUID and references another table. Tier 4 reads the already-inserted parent rows and picks one. Topo-sorted insert order means the parent always exists by the time the child needs it.

The picker isn't uniform — a few parents end up with lots of children and most get a few. Uniform random would be its own kind of fake; real datasets cluster.

Why the LLM is the last resort

The naive version of seedkit calls the LLM for every cell. Three reasons that's a bad idea.

Cost. Run a 14k-row generation for a real schema where every cell goes through the model and you're looking at dollars per run, easily double-digits at scale. Tier 3 cuts the LLM-touched cell count to maybe 10% of the total, batched 50 at a time, which lands the actual LLM bill in cents. Run seedkit on every PR on every team and the compounding kills you the first way; the strategy stack makes it a rounding error.

Determinism. LLMs at temperature > 0 are non-deterministic by design. At temperature 0 they're almost deterministic, but providers reserve the right to change the underlying model behind any given identifier — and they do, regularly. Build your reproducibility on "the LLM said X" and your fixtures will silently shift the next time Anthropic or OpenAI deploys a quietly-different revision. The strategy stack pins outputs in our own cache, so model drift doesn't propagate to your CI.

Quality. This one's counterintuitive. LLMs make worse fake emails than faker.internet.email(). They over-pattern-match: half the addresses come back john.doe@example.com or some flavor of that, because that's what they saw in training data. A proper faker library produces output that's slightly more random and slightly more realistic-feeling, because it's drawing from name lists and TLD lists that humans built for exactly this purpose. The model should fill the gaps the fakers can't — bios, descriptions, free-text — not the gaps they already cover.

Determinism, the boring half

People notice "deterministic" in the seedkit pitch and assume it means the LLM is being clever. It isn't. The LLM is the most non-deterministic thing in the system. Determinism comes from the layer underneath.

Every random choice — which faker entry, which LLM completion to keep when we asked for 50 bios but only need 47, which parent to FK to, which JSON shape to fill — runs through a single PRNG seeded from the seed name. We use xoroshiro128+ specifically because it's fast, has a long period, and the state is a couple of u64s we can serialize cheaply.

The seed name (my-fixture) becomes a PRNG state via SHA-256: take the first 16 bytes of the hash, split into two u64s, that's your initial state. Two devs running --seed my-fixture against the same schema get the same PRNG sequence. Same sequence means same faker picks, same FK choices, same row order, same bytes.

The LLM still runs once per unique cache miss, but its output gets pinned the first time and replayed from cache after that. So the practical guarantee is: the first run with a given seed has some non-determinism (which LLM completion landed in the cache), and every run after is byte-identical. In CI, whoever ran --seed my-fixture first establishes the canonical bytes; everyone else with --from-cache matches them.

That's the part I built this for, the reason it isn't a faker wrapper. Same data on every laptop and every CI runner, without paying the LLM bill twice.

The cache key

The cache stores generated SQL keyed by sha256(schema_normalized + seed_name + generator_versions).

schema_normalized is the parsed schema rewritten in a stable canonical form — column order normalized, type aliases expanded, constraint clauses sorted. So varchar and text differ but two equivalent CREATE TABLE statements with cosmetic reordering hash the same.

seed_name is what you passed via --seed. Empty string is also valid; that's the "I don't care, just give me data" mode that doesn't get cached.

generator_versions is a per-release pin — bump the strategy-stack code or the prompt for Tier 3 and the version bumps with it. Old cached entries stay valid until something real changes about how rows get generated, at which point the key changes and the cache transparently regenerates.

The thing that makes this work for teams isn't the hash function. It's that the cache is org-scoped. Whoever pays the LLM bill once primes it for everyone in their org, and --from-cache turns into a free, deterministic, byte-stable seed for CI.

This is also the answer to "but the LLM is non-deterministic, so how can --from-cache reproduce?" By the time --from-cache is doing anything, the LLM has already been called, the bytes have already been frozen, and we're just replaying SQL.

What this doesn't solve

A few failure modes I should name.

The strategy stack doesn't help when your domain is genuinely esoteric. If your schema is for a logistics company shipping tropical fish, and the model has seen ten total tweets about tropical-fish logistics, the bios it generates will be uncanny. There's no clean fix other than "use the prompt" — seedkit seed --scope "tropical fish logistics" adds context that makes Tier 3 less random.

It also doesn't reproduce the statistical shape of real data. Your prod table might have a 90/10 split between two enum values; seedkit gives you something closer to 50/50 unless you tell it otherwise. For some kinds of testing (analytics, A/B tooling, anomaly detection) that matters. We don't expose distribution hints yet — that's on the list.

And it doesn't generate realistic geographic clustering, time-series patterns, or anything where rows depend on each other in ways that aren't foreign keys. Cells are mostly independent samples; reality usually isn't.

Try it

npx @seedkit-dev/cli new --prompt "describe what you're building"

If you want to dig in, the concepts docs cover each piece in more depth, and the CLI is on GitHub. Bug reports, ideas, war stories: ben@seedkit.dev.

Keep reading

Product updates

Hello, seedkit blog

Welcome to the seedkit blog — short notes on synthetic data, Postgres, and shipping faster.

Apr 22, 2026 · 1 min read

Product updates

Seedkit: realistic Postgres data without copying production

Why I built seedkit — a CLI that reads your Postgres schema fresh each run, generates realistic FK-correct data, and hands you a connection string. Without leaking prod, without breaking on schema drift.

Apr 28, 2026 · 3 min read

Guides

How to seed a Postgres database: the 5 ways teams actually do it

An honest tour of the five common ways engineering teams seed a Postgres database for local dev and tests — what each costs, when each breaks, and how to pick one without regretting it three sprints later.

May 2, 2026 · 9 min read