DataSynth: Reference Knowledge Graphs for Enterprise Audit Analytics through Synthetic Data Generation with Provable Statistical Properties
VynFi Research Team | April 2026
Abstract
Enterprise audit analytics faces a fundamental epistemological challenge: the ground truth of financial data is not directly observable. Recovering truth from enterprise data is a combinatorial inverse problem whose configuration space grows super-exponentially with journal entry complexity, rendering exhaustive or even approximate recovery computationally infeasible. Meanwhile, systematic errors in multi-stage business processes propagate through downstream controls with 77 to 95 percent probability of surviving undetected.
This paper introduces a forward generation paradigm implemented in DataSynth, a Rust-based engine that constructs synthetic financial data with provable statistical properties through a three-layer knowledge model: structural (topology of financial relationships), statistical (empirically calibrated distributions), and normative (accounting rules and business constraints). The engine is calibrated against 155 real-world datasets comprising 364 million journal entries and 2.4 billion line items.
DataSynth achieves Benford Mean Absolute Deviation scores below 0.006, maintains 100 percent balance assertion compliance, generates over 130 anomaly subtypes with ground-truth labels across five knowledge dimensions, and produces over 200,000 journal entries per second. An epsilon-differential privacy fingerprinting module enables statistical calibration against real datasets without exposing individual records.
Note: This paper is currently under peer review. The findings and methodology described here are subject to revision based on the review process. We will update this page when the paper is published.
Key Findings
Quantitative results from the DataSynth research
Journal Entries
Calibrated against 155 real-world datasets comprising 364 million journal entries and 2.4 billion line items across multiple sectors and geographies.
Anomaly Subtypes
Multi-stage fraud schemes, isolated anomalies, and structural errors with ground-truth labels across five knowledge dimensions.
Entries per Second
Rust-based forward generation engine produces over 200,000 journal entries per second with all knowledge layers active.
Benford MAD Score
Mean Absolute Deviation well below the 0.012 threshold for close conformity. 100% of journal entries pass balance assertions.
Knowledge Model
Structural topology, statistical distributions, and normative accounting rules working together to produce provably correct synthetic data.
Error Propagation
Systematic errors in multi-stage processes survive downstream controls with 77-95% probability, demonstrating why forward generation is necessary.
Core Contributions
1. The Ground Truth Problem
The paper formalizes the infeasibility of recovering ground truth from enterprise financial data. Using Stirling numbers of the second kind, it demonstrates that the configuration space of possible data states grows to 10^155,630 for realistic enterprise complexity, making the inverse recovery problem physically impossible to solve. This establishes the theoretical necessity of forward generation approaches.
2. Three-Layer Knowledge Model
DataSynth introduces a layered architecture separating structural knowledge (entity topology and relationships), statistical knowledge (empirically calibrated distributions with copula-based dependencies), and normative knowledge (accounting rules and business constraints). This separation enables independent validation of each layer and composable generation of complex financial datasets.
3. Provable Statistical Properties
Generated data achieves Benford MAD scores below 0.006 (well within the forensic accounting threshold of 0.012 for close conformity), maintains 100% double-entry balance compliance, and preserves cross-column correlations within 0.05 of calibration targets. These properties are not aspirational; they are verifiable on every generated dataset.
4. Ground-Truth Anomaly Framework
Over 130 anomaly subtypes are organized across five knowledge dimensions (temporal, amount, relationship, pattern, structural). Multi-stage fraud schemes are modeled as state machines that evolve across transactions. Every anomalous record carries ground-truth labels for type, severity, difficulty, confidence, and scheme membership.
5. Privacy-Preserving Fingerprints
An epsilon-differential privacy module extracts statistical fingerprints from real datasets without retaining individual records. The clean separation of the privacy boundary from the generation process enables cross-organization analytics without data exposure. Configurable epsilon from 0.01 to 1.0.
Explore the Research
Deep dives into individual topics from the paper
The Ground Truth Problem in Enterprise Audit Analytics
Why the inverse problem is infeasible and systematic errors propagate through multi-stage processes with 77-95% probability.
Read postHow VynFi Generates Statistically Rigorous Financial Data
The three-layer knowledge model, Benford compliance, and calibration against 155 real-world datasets.
Read post130+ Fraud Scenarios: Building Better Fraud Detection Models
Labeled fraud training data with multi-stage schemes and ground-truth labels across five dimensions.
Read postPrivacy-Preserving Data Sharing with Differential Privacy Fingerprints
Cross-firm analytics without data exposure using epsilon-differential privacy fingerprints.
Read post