VynFi is in early access — some features may be unavailable.
Under Peer Review

DataSynth: Reference Knowledge Graphs for Enterprise Audit Analytics through Synthetic Data Generation with Provable Statistical Properties

VynFi Research Team | April 2026

Abstract

Enterprise audit analytics faces a fundamental epistemological challenge: the ground truth of financial data is not directly observable. Recovering truth from enterprise data is a combinatorial inverse problem whose configuration space grows super-exponentially with journal entry complexity, rendering exhaustive or even approximate recovery computationally infeasible. Meanwhile, systematic errors in multi-stage business processes propagate through downstream controls with 77 to 95 percent probability of surviving undetected.

This paper introduces a forward generation paradigm implemented in DataSynth, a Rust-based engine that constructs synthetic financial data with provable statistical properties through a three-layer knowledge model: structural (topology of financial relationships), statistical (empirically calibrated distributions), and normative (accounting rules and business constraints). The engine is calibrated against 155 real-world datasets comprising 364 million journal entries and 2.4 billion line items.

DataSynth achieves Benford Mean Absolute Deviation scores below 0.006, maintains 100 percent balance assertion compliance, generates over 130 anomaly subtypes with ground-truth labels across five knowledge dimensions, and produces over 200,000 journal entries per second. An epsilon-differential privacy fingerprinting module enables statistical calibration against real datasets without exposing individual records.

Note: This paper is currently under peer review. The findings and methodology described here are subject to revision based on the review process. We will update this page when the paper is published.

Key Findings

Quantitative results from the DataSynth research

364M

Journal Entries

Calibrated against 155 real-world datasets comprising 364 million journal entries and 2.4 billion line items across multiple sectors and geographies.

130+

Anomaly Subtypes

Multi-stage fraud schemes, isolated anomalies, and structural errors with ground-truth labels across five knowledge dimensions.

200K+

Entries per Second

Rust-based forward generation engine produces over 200,000 journal entries per second with all knowledge layers active.

< 0.006

Benford MAD Score

Mean Absolute Deviation well below the 0.012 threshold for close conformity. 100% of journal entries pass balance assertions.

3-Layer

Knowledge Model

Structural topology, statistical distributions, and normative accounting rules working together to produce provably correct synthetic data.

77-95%

Error Propagation

Systematic errors in multi-stage processes survive downstream controls with 77-95% probability, demonstrating why forward generation is necessary.

Core Contributions

1. The Ground Truth Problem

The paper formalizes the infeasibility of recovering ground truth from enterprise financial data. Using Stirling numbers of the second kind, it demonstrates that the configuration space of possible data states grows to 10^155,630 for realistic enterprise complexity, making the inverse recovery problem physically impossible to solve. This establishes the theoretical necessity of forward generation approaches.

2. Three-Layer Knowledge Model

DataSynth introduces a layered architecture separating structural knowledge (entity topology and relationships), statistical knowledge (empirically calibrated distributions with copula-based dependencies), and normative knowledge (accounting rules and business constraints). This separation enables independent validation of each layer and composable generation of complex financial datasets.

3. Provable Statistical Properties

Generated data achieves Benford MAD scores below 0.006 (well within the forensic accounting threshold of 0.012 for close conformity), maintains 100% double-entry balance compliance, and preserves cross-column correlations within 0.05 of calibration targets. These properties are not aspirational; they are verifiable on every generated dataset.

4. Ground-Truth Anomaly Framework

Over 130 anomaly subtypes are organized across five knowledge dimensions (temporal, amount, relationship, pattern, structural). Multi-stage fraud schemes are modeled as state machines that evolve across transactions. Every anomalous record carries ground-truth labels for type, severity, difficulty, confidence, and scheme membership.

5. Privacy-Preserving Fingerprints

An epsilon-differential privacy module extracts statistical fingerprints from real datasets without retaining individual records. The clean separation of the privacy boundary from the generation process enables cross-organization analytics without data exposure. Configurable epsilon from 0.01 to 1.0.

More About the Science

Visit the full research page for details on VynFi's statistical methodology, audit methodology benchmarks, anomaly injection framework, ML evaluation results, and privacy-preserving fingerprints.

Try VynFi: the commercial implementation

VynFi brings the DataSynth research engine to production. Generate synthetic financial data with provable statistical properties via a simple API.

10,000 free credits every month. No credit card required.