DataSynth 5.29: Corpus-Grounded Realism — Recurring Templates, Pareto Activity, Reversals, Multi-Currency
The biggest structural-fidelity round since 4.x. Five corpus-grounded levers move every measured realism metric toward production audit data: recurring posting templates (top-50 archetypes cover ~65% of JEs), Pareto account activity (top-10% carry ~95%), reversal/correction process (~10%), allocation batches (~52 lines/JE), business_unit dimension, and SAP-style multi-currency. All flag-gated, default-on (multi-currency opt-in), JE balance preserved, same-seed determinism preserved.
DataSynth 5.29 is the biggest structural-fidelity round since 4.x. As of this week the VynFi production engine, the typed translation layer, and the Generate wizard surface the five corpus-grounded levers it ships. This is the round where the engine stops modeling JEs as independent draws from calibrated distributions and starts modeling them the way a real audit corpus actually looks.
**TL;DR** — A side-by-side fingerprint comparison (real audit corpus vs SOTA-off baseline vs 5.29-on current, healthcare sector) shows the new levers move every measured realism metric toward the corpus: account-Pareto 0.16 → 0.95 (corpus 0.99), recurring-archetype share 0.13 → 0.89 (corpus 0.97), allocation lines-per-JE absent → 56 (corpus 52). The reversal-proxy detection rate is also now ~10% (matching the corpus) versus the engine's previous ~0.2%. Each lever is flag-gated, default-on (multi-currency opt-in), and preserves JE balance plus same-seed determinism.
Why this round — the corpus comparison gap
Up to 5.28, DataSynth modeled financial postings using calibrated distributions, inter-column correlations, and Benford's Law compliance. That's been enough for most VynFi customers — the data clears every statistical sniff test auditors run.
But the engine's fingerprint compared to real audit corpora showed a structural gap: the corpus is heavily templated. A small set of standard postings recurs (top-50 archetypes covered ~65% of JEs, ~97% recurring-archetype share overall), a hot subset of accounts carries most lines (top-10% of accounts handled ~95% of activity), and a handful of process types — reversals, allocation batches — drove the lines-per-JE tail. The 5.28 engine drew fresh, near-uniform accounts per line: 758 unique accounts out of 1000 JEs, against a corpus that recycled the same ~50 archetypes.
Visually, the distributions looked right. Forensically, the structural shape was wrong. A fraud-detection model trained on the engine generalised poorly to the corpus because the engine's coverage breadth made every line look slightly novel — exactly the property a real ledger doesn't have. 5.29 closes that gap.
The five levers
1. Recurring / standard-journal templates
`transactions.recurring_templates` (default-on). Per-(company, doc-type) library of reusable account archetypes — a debit-account / credit-account pair set, drawn from a separate `template_rng`. On the no-priors path, a standard posting is reused with ~90% probability instead of drawing fresh accounts. The archetype cap is 24 per (company, doc-type), so top-50 archetypes now cover ~65% of JEs (was 0.48; corpus 0.65).
Set `false` for the legacy uniform-per-line account-selection behavior. The override only affects the gl_account chosen for the line; amount, line-count, and date draws are untouched and balance is preserved.
2. Account-activity Pareto
`transactions.account_concentration` (default-on). A Zipf (s=2.0) power-law override of the per-line account pick concentrates posting activity onto a hot subset of accounts. Result: top-10% of accounts carry ~95% of lines (corpus). The uniform pool draw is still consumed on the main RNG (no direct amount/date/count change) — only the selected account moves toward the hot set, via a precomputed harmonic table and a dedicated `account_rng`.
Pinned by `test_account_concentration_creates_pareto`. The change is reproducible for a given seed but not byte-identical to 5.28 — because the selected account feeds line-text generation (whose RNG draw count is account-dependent), downstream amount values shift even though their marginal distributions are unchanged.
3. Reversal / correction process
`transactions.reversal_rate` (default 0.10). A fraction of JEs are balanced reversals of a recent entry: dr/cr swapped, header_text reads `Reversal of {id}`, reference reads `REV-{id}`, derived id is deterministically computed as `orig ^ salt` so the same source JE is never reversed twice. Built from a buffered JE so the reversal inherits the source's code, line text, and audit flags — and is interspersed via a separate `reversal_rng`.
Why this matters: auditors actively look for reversals — they're a strong indicator of bookkeeping corrections, error remediation, or potential management override. The 5.28 engine had ~0.2% reversal-proxy detection (basically none). The corpus shows ~10%. 5.29 lands the reversal rate at the corpus level. Set `0.0` to disable.
4. Allocation / assessment-batch process
`transactions.allocation_batch_rate` (default ~0.008). A small fraction of JEs are large 1-to-many allocation batches — the corpus's lines-per-JE tail is driven by allocation/assessment-batch (AB-source) documents that average ~52 lines per JE, against the engine's previous ~4.6 mean with no large-batch process. Reuses a buffered JE for a valid header, then explodes its largest debit line into ~30-80 cost-center-spread sub-lines summing to the same amount. Balance and the main RNG are preserved; the cost-center dimension breadth rises.
Tagged source `AB`, which is now reserved for this process — removed from the default source-mix so synthetic `AB` lines-per-JE matches the corpus rather than blending with small manual postings.
5. Business-unit dimension
`transactions.business_unit_dimension` (default-on). A line-level `business_unit` field — an organisational segment the corpus carries (~11 codes) but the engine lacked entirely. It is a deterministic roll-up of the cost center or profit center (FNV-bucketed into BU01..BU11), so the same CC/PC always maps to the same BU and BU-level analytics stay coherent. Populated wherever a CC or PC is present — including allocation-batch sub-lines, which re-derive BU from their overridden CC.
Emitted in JSON (automatic), the full journal_entries.csv (now 47 columns), and the Parquet sink (now 16 columns). Set `false` to leave it empty (legacy). Header-parity self-checks and Parquet schema assertions are updated accordingly; integration tests resolve columns by name, so the added column is non-breaking for column-name-resolving consumers (which is what VynFi's portal visualizers use — see the regression-pin test that landed in PR #74).
Bonus: SAP-style multi-currency postings
`transactions.foreign_currency_rate` (default-off). SAP-style: a fraction of JEs post in a foreign document currency. `debit_amount` / `credit_amount` / `local_amount` stay the company-ledger amount (SAP DMBTR — the trial balance and every ledger aggregation are untouched), while the new per-line `transaction_amount` (WRBTR) plus `header.currency` (WAERS) and `header.exchange_rate` carry the foreign value. The JE balances in both currencies.
Default-off (opt-in) — enable for FX realism / corpus-matching (the corpus shows ~3.5% functional-currency-≠-reporting-currency JEs). Additive at the schema level: new `transaction_amount` column → journal_entries.csv now 48 columns when enabled; JSON automatic. Drawn from a separate `fx_rng` (main RNG untouched); applies on the normal posting path — reversals and allocation batches stay company-currency. VynFi surfaces this as a Scale+ wizard toggle (see /dashboard/generate Enrich step).
What the round measured — and what didn't move
**Side-by-side fingerprint (healthcare sector, FINDINGS.md §10)**:\n\n• Account-Pareto: corpus 0.99 / SOTA-off 0.16 / **5.29-on 0.95** ✓\n• Recurring-archetype share: corpus 0.97 / SOTA-off 0.13 / **5.29-on 0.89** ✓\n• Allocation lines-per-JE: corpus 52 / SOTA-off absent / **5.29-on 56** ✓\n• Reversal-proxy detection: corpus ~0.10 / SOTA-off 0.002 / **5.29-on 0.10** ✓ (post-tuning)\n• Top-50 archetype coverage: corpus 0.65 / SOTA-off 0.48 / **5.29-on ~0.65** ✓ (post-tuning)\n• Source-mix entropy: matches the corpus; raw distinct-code count intentionally left broad (general-realism breadth, not over-fit to the health subset's 46 codes).
Three metrics needed tuning between the initial round and the GA release. The reversal rate was bumped from 0.04 → 0.10 (proxy detects ~85% of reversals, so 0.10 lands at the corpus's ~0.10). Templating depth was tuned: reuse probability 0.82 → 0.90 and per-(company, doc-type) archetype cap 48 → 24, so the top-50 archetypes cover more JEs. Business-unit fill was lifted by rolling up from CC or PC (fallback): ~24% → ~82%, matching the corpus.
What this changes for your ML / fraud-detection pipeline
If you train a fraud detector on VynFi data and validate it against your production ledger, 5.29 is the round you've been waiting for. The structural-shape gap was where models trained on synthetic data generalised poorly to real corpora — the model learned features keyed to the synthetic data's uniform coverage and missed the heavy-templating signal the real ledger carries.
Concretely: a fraud-detection model trained on 5.29 sees a hot account subset (which is what the production ledger has) and learns to flag deviations from it. The 5.28 model saw uniform coverage and had no hot-set signal to deviate from. The same story holds for anomaly detection on lines-per-JE (the corpus's bimodal small-manual-vs-large-allocation pattern is now present), reversal detection (now ~10% prevalence matches what an auditor expects), and business-unit-level analytics (now coherent across JEs sharing the same cost center).
Determinism and backwards compatibility
Every lever uses a dedicated RNG stream — `template_rng`, `account_rng`, `reversal_rng`, `allocation_rng`, `fx_rng` — so the direct amount, line-count, and date draws on the main RNG are untouched. The output for a given seed is reproducible. It is NOT byte-for-byte identical to 5.28, because the selected account is fed to line-text generation (whose RNG draw count is account-dependent) — so downstream amount values shift even though their distributions are unchanged.
If you need 5.28 output (e.g., a regression test that pins specific row values), every lever can be disabled. Set `recurring_templates: false`, `account_concentration: 0.0`, `reversal_rate: 0.0`, `allocation_batch_rate: 0.0`, `business_unit_dimension: false`. The 5.28 behavior is preserved exactly. JE balance, Benford compliance, and amount/line-count marginal distributions are preserved across every configuration.
How to try it
All five levers are surfaced in the Generate wizard. Multi-currency is Scale+ tier-gated (the Enrich step has an expandable section); the other four are on by default for every tier and need no configuration. Via the API:
{ "preset": "manufacturing_mid", "rows": { "journal_entries": 50000 }, "transactions": { "recurring_templates": true, "account_concentration": 2.0, "reversal_rate": 0.10, "allocation_batch_rate": 0.008, "business_unit_dimension": true, "foreign_currency_rate": 0.035 }}Or just submit without any `transactions` block — every default in 5.29 lands you at corpus parity. The wizard's preset configurations match.
What's next
5.29 closes the corpus-grounded structural-fidelity loop. The next round (5.30) is about validating that closure rigorously: a Sajja-style exact detection-rate eval (research-grade fraud-detection benchmarking), per-source IET burst clustering for temporal realism, and per-process fraud rate overrides for ML training datasets that want differential difficulty across business processes. We're also tightening the memory ceiling on enterprise-scale group consolidations (streaming aggregate walks, IC-only JE retention, mimalloc as global allocator).
The plan: stop adding realism features and start proving the ones we have. We'll have more to share when 5.30 ships.