Early accessSome features may be unavailable
Back to Blog
DataSynth 5.29datasynthreleasecorpus-groundedstructural-fidelitysotaauditfraud-detectionml-training

DataSynth 5.29: Corpus-Grounded Realism — Recurring Templates, Pareto Activity, Reversals, Multi-Currency

The biggest structural-fidelity round since 4.x. Five corpus-grounded levers move every measured realism metric toward production audit data: recurring posting templates (top-50 archetypes cover ~65% of JEs), Pareto account activity (top-10% carry ~95%), reversal/correction process (~10%), allocation batches (~52 lines/JE), business_unit dimension, and SAP-style multi-currency. All flag-gated, default-on (multi-currency opt-in), JE balance preserved, same-seed determinism preserved.

VynFi Team · EngineeringMay 27, 202611 min read

DataSynth 5.29 is the biggest structural-fidelity round since 4.x. As of this week the VynFi production engine, the typed translation layer, and the Generate wizard surface the five corpus-grounded levers it ships. This is the round where the engine stops modeling JEs as independent draws from calibrated distributions and starts modeling them the way a real audit corpus actually looks.

**TL;DR** — A side-by-side fingerprint comparison (real audit corpus vs SOTA-off baseline vs 5.29-on current, healthcare sector) shows the new levers move every measured realism metric toward the corpus: account-Pareto 0.16 → 0.95 (corpus 0.99), recurring-archetype share 0.13 → 0.89 (corpus 0.97), allocation lines-per-JE absent → 56 (corpus 52). The reversal-proxy detection rate is also now ~10% (matching the corpus) versus the engine's previous ~0.2%. Each lever is flag-gated, default-on (multi-currency opt-in), and preserves JE balance plus same-seed determinism.

Why this round — the corpus comparison gap

Up to 5.28, DataSynth modeled financial postings using calibrated distributions, inter-column correlations, and Benford's Law compliance. That's been enough for most VynFi customers — the data clears every statistical sniff test auditors run.

But the engine's fingerprint compared to real audit corpora showed a structural gap: the corpus is heavily templated. A small set of standard postings recurs (top-50 archetypes covered ~65% of JEs, ~97% recurring-archetype share overall), a hot subset of accounts carries most lines (top-10% of accounts handled ~95% of activity), and a handful of process types — reversals, allocation batches — drove the lines-per-JE tail. The 5.28 engine drew fresh, near-uniform accounts per line: 758 unique accounts out of 1000 JEs, against a corpus that recycled the same ~50 archetypes.

Visually, the distributions looked right. Forensically, the structural shape was wrong. A fraud-detection model trained on the engine generalised poorly to the corpus because the engine's coverage breadth made every line look slightly novel — exactly the property a real ledger doesn't have. 5.29 closes that gap.

The five levers

1. Recurring / standard-journal templates

`transactions.recurring_templates` (default-on). Per-(company, doc-type) library of reusable account archetypes — a debit-account / credit-account pair set, drawn from a separate `template_rng`. On the no-priors path, a standard posting is reused with ~90% probability instead of drawing fresh accounts. The archetype cap is 24 per (company, doc-type), so top-50 archetypes now cover ~65% of JEs (was 0.48; corpus 0.65).

Set `false` for the legacy uniform-per-line account-selection behavior. The override only affects the gl_account chosen for the line; amount, line-count, and date draws are untouched and balance is preserved.

2. Account-activity Pareto

`transactions.account_concentration` (default-on). A Zipf (s=2.0) power-law override of the per-line account pick concentrates posting activity onto a hot subset of accounts. Result: top-10% of accounts carry ~95% of lines (corpus). The uniform pool draw is still consumed on the main RNG (no direct amount/date/count change) — only the selected account moves toward the hot set, via a precomputed harmonic table and a dedicated `account_rng`.

Pinned by `test_account_concentration_creates_pareto`. The change is reproducible for a given seed but not byte-identical to 5.28 — because the selected account feeds line-text generation (whose RNG draw count is account-dependent), downstream amount values shift even though their marginal distributions are unchanged.

3. Reversal / correction process

`transactions.reversal_rate` (default 0.10). A fraction of JEs are balanced reversals of a recent entry: dr/cr swapped, header_text reads `Reversal of {id}`, reference reads `REV-{id}`, derived id is deterministically computed as `orig ^ salt` so the same source JE is never reversed twice. Built from a buffered JE so the reversal inherits the source's code, line text, and audit flags — and is interspersed via a separate `reversal_rng`.

Why this matters: auditors actively look for reversals — they're a strong indicator of bookkeeping corrections, error remediation, or potential management override. The 5.28 engine had ~0.2% reversal-proxy detection (basically none). The corpus shows ~10%. 5.29 lands the reversal rate at the corpus level. Set `0.0` to disable.

4. Allocation / assessment-batch process

`transactions.allocation_batch_rate` (default ~0.008). A small fraction of JEs are large 1-to-many allocation batches — the corpus's lines-per-JE tail is driven by allocation/assessment-batch (AB-source) documents that average ~52 lines per JE, against the engine's previous ~4.6 mean with no large-batch process. Reuses a buffered JE for a valid header, then explodes its largest debit line into ~30-80 cost-center-spread sub-lines summing to the same amount. Balance and the main RNG are preserved; the cost-center dimension breadth rises.

Tagged source `AB`, which is now reserved for this process — removed from the default source-mix so synthetic `AB` lines-per-JE matches the corpus rather than blending with small manual postings.

5. Business-unit dimension

`transactions.business_unit_dimension` (default-on). A line-level `business_unit` field — an organisational segment the corpus carries (~11 codes) but the engine lacked entirely. It is a deterministic roll-up of the cost center or profit center (FNV-bucketed into BU01..BU11), so the same CC/PC always maps to the same BU and BU-level analytics stay coherent. Populated wherever a CC or PC is present — including allocation-batch sub-lines, which re-derive BU from their overridden CC.

Emitted in JSON (automatic), the full journal_entries.csv (now 47 columns), and the Parquet sink (now 16 columns). Set `false` to leave it empty (legacy). Header-parity self-checks and Parquet schema assertions are updated accordingly; integration tests resolve columns by name, so the added column is non-breaking for column-name-resolving consumers (which is what VynFi's portal visualizers use — see the regression-pin test that landed in PR #74).

Bonus: SAP-style multi-currency postings

`transactions.foreign_currency_rate` (default-off). SAP-style: a fraction of JEs post in a foreign document currency. `debit_amount` / `credit_amount` / `local_amount` stay the company-ledger amount (SAP DMBTR — the trial balance and every ledger aggregation are untouched), while the new per-line `transaction_amount` (WRBTR) plus `header.currency` (WAERS) and `header.exchange_rate` carry the foreign value. The JE balances in both currencies.

Default-off (opt-in) — enable for FX realism / corpus-matching (the corpus shows ~3.5% functional-currency-≠-reporting-currency JEs). Additive at the schema level: new `transaction_amount` column → journal_entries.csv now 48 columns when enabled; JSON automatic. Drawn from a separate `fx_rng` (main RNG untouched); applies on the normal posting path — reversals and allocation batches stay company-currency. VynFi surfaces this as a Scale+ wizard toggle (see /dashboard/generate Enrich step).

What the round measured — and what didn't move

**Side-by-side fingerprint (healthcare sector, FINDINGS.md §10)**:\n\n• Account-Pareto: corpus 0.99 / SOTA-off 0.16 / **5.29-on 0.95** ✓\n• Recurring-archetype share: corpus 0.97 / SOTA-off 0.13 / **5.29-on 0.89** ✓\n• Allocation lines-per-JE: corpus 52 / SOTA-off absent / **5.29-on 56** ✓\n• Reversal-proxy detection: corpus ~0.10 / SOTA-off 0.002 / **5.29-on 0.10** ✓ (post-tuning)\n• Top-50 archetype coverage: corpus 0.65 / SOTA-off 0.48 / **5.29-on ~0.65** ✓ (post-tuning)\n• Source-mix entropy: matches the corpus; raw distinct-code count intentionally left broad (general-realism breadth, not over-fit to the health subset's 46 codes).

Three metrics needed tuning between the initial round and the GA release. The reversal rate was bumped from 0.04 → 0.10 (proxy detects ~85% of reversals, so 0.10 lands at the corpus's ~0.10). Templating depth was tuned: reuse probability 0.82 → 0.90 and per-(company, doc-type) archetype cap 48 → 24, so the top-50 archetypes cover more JEs. Business-unit fill was lifted by rolling up from CC or PC (fallback): ~24% → ~82%, matching the corpus.

What this changes for your ML / fraud-detection pipeline

If you train a fraud detector on VynFi data and validate it against your production ledger, 5.29 is the round you've been waiting for. The structural-shape gap was where models trained on synthetic data generalised poorly to real corpora — the model learned features keyed to the synthetic data's uniform coverage and missed the heavy-templating signal the real ledger carries.

Concretely: a fraud-detection model trained on 5.29 sees a hot account subset (which is what the production ledger has) and learns to flag deviations from it. The 5.28 model saw uniform coverage and had no hot-set signal to deviate from. The same story holds for anomaly detection on lines-per-JE (the corpus's bimodal small-manual-vs-large-allocation pattern is now present), reversal detection (now ~10% prevalence matches what an auditor expects), and business-unit-level analytics (now coherent across JEs sharing the same cost center).

Determinism and backwards compatibility

Every lever uses a dedicated RNG stream — `template_rng`, `account_rng`, `reversal_rng`, `allocation_rng`, `fx_rng` — so the direct amount, line-count, and date draws on the main RNG are untouched. The output for a given seed is reproducible. It is NOT byte-for-byte identical to 5.28, because the selected account is fed to line-text generation (whose RNG draw count is account-dependent) — so downstream amount values shift even though their distributions are unchanged.

If you need 5.28 output (e.g., a regression test that pins specific row values), every lever can be disabled. Set `recurring_templates: false`, `account_concentration: 0.0`, `reversal_rate: 0.0`, `allocation_batch_rate: 0.0`, `business_unit_dimension: false`. The 5.28 behavior is preserved exactly. JE balance, Benford compliance, and amount/line-count marginal distributions are preserved across every configuration.

How to try it

All five levers are surfaced in the Generate wizard. Multi-currency is Scale+ tier-gated (the Enrich step has an expandable section); the other four are on by default for every tier and need no configuration. Via the API:

JSON
{
"preset": "manufacturing_mid",
"rows": { "journal_entries": 50000 },
"transactions": {
"recurring_templates": true,
"account_concentration": 2.0,
"reversal_rate": 0.10,
"allocation_batch_rate": 0.008,
"business_unit_dimension": true,
"foreign_currency_rate": 0.035
}
}

Or just submit without any `transactions` block — every default in 5.29 lands you at corpus parity. The wizard's preset configurations match.

What's next

5.29 closes the corpus-grounded structural-fidelity loop. The next round (5.30) is about validating that closure rigorously: a Sajja-style exact detection-rate eval (research-grade fraud-detection benchmarking), per-source IET burst clustering for temporal realism, and per-process fraud rate overrides for ML training datasets that want differential difficulty across business processes. We're also tightening the memory ceiling on enterprise-scale group consolidations (streaming aggregate walks, IC-only JE retention, mimalloc as global allocator).

The plan: stop adding realism features and start proving the ones we have. We'll have more to share when 5.30 ships.

Ready to try VynFi?

Start generating synthetic financial data with 10,000 free credits. No credit card required.