DocsConcepts

Synthetic data

How seedkit decides what to put in each column — the layered generator stack, deterministic PRNG, and what makes the data "realistic."


Seedkit's data generator is a layered strategy: cheap, deterministic heuristics first; LLM prompts only where they add value. Per column, we pick the highest-fidelity generator that applies.

The generator stack

For each column, generators are tried in this order. The first one that applies wins:

  1. Typed-name match. If the column name is an exact match for a known semantic type (email, phone, iso_country, currency, iban, bic, slug), use the matching generator directly. Cheap, deterministic, no model call.
  2. Type-shape inference. For jsonb, array, enum, and other structured types, infer a shape from the surrounding schema and generate structurally-valid values.
  3. Domain LLM prompt. For free-text and ambiguous columns, prompt the model with the table name, column name, sibling-column context, and (if set) the --scope hint. The model returns a small batch of values; we shuffle them with the per-seed PRNG and assign.
  4. FK lookup. Foreign-key columns sample from already-inserted parent rows, in topo-sorted order.

Determinism

Sampling uses a per-seed PRNG (xoroshiro128+ with a seed derived from the seed name). Two runs with the same seed name and schema produce byte-identical inserts. There's no model temperature roulette, no "regenerate to get something usable."

seedkit new --prompt "saas crm" --seed crm-demo
seedkit new --prompt "saas crm" --seed crm-demo
# Same rows. Same order. Same bytes.

See Determinism for the full guarantee.

What "realistic" means

  • Names look like names — locale-aware when --scope indicates locale (German addresses look German, Japanese names look Japanese).
  • Emails end in plausible domains and match the org/company context (a user at "Aventar" gets priya@aventar.co, not priya@example.com).
  • Money fields follow the relevant currency, and distributions look log-normal-ish (lots of small transactions, a long tail).
  • Free-text fields don't say lorem ipsum. They say whatever is contextually plausible — for a CRM they're notes about deals; for fintech they're transaction memos.

What it isn't

  • It's not LLM-only. Most columns never see a model call. That's why it's fast and cheap.
  • It's not pure heuristics. The model fills in where rule-based generators would produce obvious junk.
  • It's not differentially-private synthetic data — we don't take a real dataset as input. If you need that, see Snaplet's open-source code for a different approach.