DocsConcepts
Synthetic data
How seedkit decides what to put in each column — the layered generator stack, deterministic PRNG, and what makes the data "realistic."
Seedkit's data generator is a layered strategy: cheap, deterministic heuristics first; LLM prompts only where they add value. Per column, we pick the highest-fidelity generator that applies.
The generator stack
For each column, generators are tried in this order. The first one that applies wins:
- Typed-name match. If the column name is an exact match for a known semantic type (
email,phone,iso_country,currency,iban,bic,slug), use the matching generator directly. Cheap, deterministic, no model call. - Type-shape inference. For
jsonb,array,enum, and other structured types, infer a shape from the surrounding schema and generate structurally-valid values. - Domain LLM prompt. For free-text and ambiguous columns, prompt the model with the table name, column name, sibling-column context, and (if set) the
--scopehint. The model returns a small batch of values; we shuffle them with the per-seed PRNG and assign. - FK lookup. Foreign-key columns sample from already-inserted parent rows, in topo-sorted order.
Determinism
Sampling uses a per-seed PRNG (xoroshiro128+ with a seed derived from the seed name). Two runs with the same seed name and schema produce byte-identical inserts. There's no model temperature roulette, no "regenerate to get something usable."
seedkit new --prompt "saas crm" --seed crm-demo
seedkit new --prompt "saas crm" --seed crm-demo
# Same rows. Same order. Same bytes.
See Determinism for the full guarantee.
What "realistic" means
- Names look like names — locale-aware when
--scopeindicates locale (German addresses look German, Japanese names look Japanese). - Emails end in plausible domains and match the org/company context (a user at "Aventar" gets
priya@aventar.co, notpriya@example.com). - Money fields follow the relevant currency, and distributions look log-normal-ish (lots of small transactions, a long tail).
- Free-text fields don't say
lorem ipsum. They say whatever is contextually plausible — for a CRM they're notes about deals; for fintech they're transaction memos.
What it isn't
- It's not LLM-only. Most columns never see a model call. That's why it's fast and cheap.
- It's not pure heuristics. The model fills in where rule-based generators would produce obvious junk.
- It's not differentially-private synthetic data — we don't take a real dataset as input. If you need that, see Snaplet's open-source code for a different approach.