The Science Behind Synthetic Financial Data
An exploration of the methodology, statistical foundations, and practical applications behind VynFi's synthetic data engine, designed for enterprises, fintechs, and researchers.
DataSynth: Reference Knowledge Graphs for Enterprise Audit Analytics through Synthetic Data Generation with Provable Statistical Properties
VynFi Research Team
This paper introduces the forward generation paradigm for enterprise audit analytics. It demonstrates that recovering ground truth from enterprise data is computationally infeasible, and presents a three-layer knowledge model calibrated against 155 real-world datasets (364M journal entries, 2.4B line items) that produces synthetic data with provable statistical properties.
364M
Journal entries calibrated
< 0.006
Benford MAD score
200K+
Entries per second
130+
Labeled anomaly subtypes
Executive Summary
Financial institutions, auditors, and technology companies face an increasingly difficult paradox: they need large volumes of realistic financial data for development, testing, and training, yet regulatory frameworks like GDPR, CCPA, and SOX make it prohibitively risky and expensive to use production data outside of controlled environments. The cost of data breaches in financial services reached a global average of $5.9 million per incident in 2023, reinforcing the urgency of finding alternatives.
Synthetic financial data resolves this tension by generating records that are statistically faithful to real-world patterns while containing zero personally identifiable information. VynFi's DataSynth engine targets over 100,000 rows per second across 8 industry sectors, with built-in validation including Benford's Law compliance, inter-column correlation preservation, and distribution fidelity scoring.
This whitepaper examines the problem landscape, VynFi's technical approach, comparative advantages over existing solutions, and practical use cases across audit training, fintech development, academic research, and compliance validation.
The Problem
Why real financial data falls short for development and testing
Data Privacy
Production financial data contains PII and sensitive account information. Using it in non-production environments creates regulatory exposure and breach risk.
Compliance Burden
Regulations like GDPR, CCPA, and SOX restrict how financial data can be copied, stored, and accessed across teams and geographies.
Cost & Access
Acquiring representative financial datasets is expensive and time-consuming. Licensing fees, legal review, and anonymization overhead slow development cycles.
Limited Scale
Real-world datasets are finite. Stress-testing systems at 10x or 100x production volume is impossible without synthetic generation capabilities.
Our Approach
Four pillars powering the DataSynth engine
Statistical Modeling
DataSynth uses calibrated statistical distributions, Benford's Law compliance, and inter-column correlation matrices to produce data that is structurally indistinguishable from real financial records.
Sector Calibration
Each of VynFi's 8 sector models is tuned against empirical benchmarks from real-world financial data. Retail transactions, banking ledgers, and healthcare billing each have distinct statistical signatures that our engine reproduces faithfully.
Quality Validation
Every generated dataset undergoes automated quality checks: distribution fidelity scoring, correlation preservation tests, anomaly frequency validation, and Benford's Law compliance. Datasets that fail thresholds are rejected and regenerated.
Financial Coherence
Every generated dataset passes 32+ internal consistency checks — trial balance proof, FG rollforward, cash flow reconciliation, equity rollforward, segment-to-consolidated reconciliation, and intercompany elimination. Data that fails your audit tests is a bug, not a feature.
Audit Methodology Benchmarks
Cross-firm comparison and validation across industry-standard audit methodologies.
Big 4 Methodology Coverage
KPMG Clara, PwC Aura, Deloitte Omnia, and EY GAM blueprints with procedure-level comparison across 518 standards.
Blueprint Testing Framework
Automated validation of blueprint completeness, coverage metrics, and step consistency across methodologies.
Progressive Difficulty Benchmarks
Curriculum generation with graduated complexity for auditor training and AI model evaluation.
Process Mining Integration
Synthetic event logs for algorithm benchmarking and tooling evaluation.
Industry-Standard Exports
Disco, Celonis IBC, XES 2.0, and OCEL 2.0 format support for direct compatibility with leading process mining platforms.
Curriculum Generation
Progressive difficulty benchmark suites for process mining education, from simple linear flows to complex parallel and looping processes.
Algorithm Benchmarking
Synthetic event logs with known ground truth for conformance checking evaluation, enabling rigorous measurement of discovery algorithm accuracy.
Anomaly Injection Framework
33 anomaly types across 5 categories with configurable difficulty, severity, and confidence scoring
Timing
7 typesWeekend posting, off-hours transactions, holiday entries, backdating, future-dating, period-end clustering, unusual frequency
Amount
8 typesRound numbers, just-below-threshold, duplicate amounts, outlier values, Benford violations, split transactions, structuring, unusual ratios
Relationship
6 typesGhost vendors, missing approvals, circular references, orphan entries, mismatched counterparties, self-dealing patterns
Pattern
7 typesDuplicate payments, sequential invoices, round-trip flows, gradual increases, clustering behavior, layering sequences, channel switching
Structural
5 typesMissing fields, schema violations, referential integrity breaks, encoding anomalies, metadata inconsistencies
Each anomaly is tagged with difficulty (how hard it is to detect), severity (financial impact level), and confidence (certainty the record is truly anomalous) scores for supervised ML training.
Multi-Stage Fraud Schemes
Generate complex, multi-transaction fraud scenarios for advanced detection model training
Vendor Kickback Scheme
Simulates collusion between employees and vendors through inflated invoices, fictitious line items, and split payment patterns that evade single-transaction thresholds.
Gradual Embezzlement
Models slow-burn fraud with progressively increasing misappropriation over months, using account reclassifications and timing manipulation to avoid detection.
Revenue Manipulation
Generates channel-stuffing patterns, premature revenue recognition, and round-tripping transactions designed to inflate reported revenue across periods.
ML Evaluation Results
VynFi synthetic data achieves within 3% of real-data F1 scores across three detection model families
| Model | Type | F1 (Real Data) | F1 (VynFi Data) | Delta |
|---|---|---|---|---|
| Isolation Forest | Unsupervised | 0.82 | 0.80 | -2.4% |
| XGBoost | Supervised | 0.91 | 0.89 | -2.2% |
| GCN (Graph) | Graph Neural Net | 0.88 | 0.86 | -2.3% |
Models trained on VynFi synthetic data and evaluated on held-out real-world test sets. Results demonstrate that synthetic-trained models generalize effectively to production data.
Interactive demos on Hugging Face
Four interactive Spaces and a trained model — built on VynFi synthetic data and published under permissive licenses. No API key, no login, just click and explore.
Accounting Network Explorer
Interactive ISO 21378 Level-2 account-class graph from je_network.parquet. Pan, zoom, and click any node to see the underlying journal-entry flows by class.
Open the explorerData Explorer
Browse VynFi reference datasets in your browser — column profiles, schema view, sample rows, and side-by-side comparison across the published parquet artifacts.
Browse datasetsFraud-GNN Demo
Three tabs in a single demo: an edge fraud predictor, a node anomaly explorer, and a live ROC curve that re-renders as you change the decision threshold.
Try the demoProcess Mining Demo
pm4py directly-follows graph (DFG) over a supply-chain OCEL event log. Filter by activity, see variant frequencies, export the discovered process model.
Open the demoTrained Models
Pre-trained checkpoints, weights, and inference recipes — drop them straight into a fraud or anomaly pipeline.
VynFi/je-fraud-gnn
GraphSAGE 2-layer journal-entry fraud classifier trained on the vynfi-journal-entries-1m dataset. Test AUC 0.914, F1 0.78 on a held-out manufacturing JE corpus with propagated fraud labels. Loadable directly via Hugging Face transformers or PyTorch Geometric.
GAE node anomaly scorer
Companion graph auto-encoder for unsupervised node-level anomaly scoring on the same accounting-network graph. Useful as a pre-filter before applying the supervised GraphSAGE classifier, or as a standalone anomaly score when ground-truth labels aren't available.
See on Hugging FacePre-baked datasets on Hugging Face
Skip the generation step. Seven curated reference datasets ship the latest DataSynth outputs with ISO 21378 fields, fraud propagation labels, and microsecond-precision OCEL timestamps. Load directly via datasets.load_dataset("VynFi/<slug>").
VynFi/vynfi-journal-entries-1m
2.1M JE line items, manufacturing sector, ~7% fraud with propagation labels, 12 periods.
VynFi/vynfi-aml-100k
Banking + AML labels with 0.857 typology coverage, 38× denser network, mule_link / shell_link edges.
VynFi/vynfi-group-audit-enterprise-2000
Audit-ready 100-entity consolidated group dataset under IFRS 3 / 10 / 28 / 21 + ISA 600.
VynFi/vynfi-ocel-manufacturing
OCEL 2.0 manufacturing event log, microsecond timestamps, 162 variants, 55% happy-path concentration.
VynFi/vynfi-audit-p2p
P2P document chain (PO → GR → invoice → payment, 234 docs) with is_fraud_propagated and fraud_source_document_id for end-to-end audit-trail walks.
VynFi/vynfi-supply-chain-ocel
Cross-process mining OCEL event log spanning ordering, fulfilment, and returns with realistic imperfection rates (rework 15% / skip 10% / out-of-order 8%).
VynFi/vynfi-sar-narratives
Suspicious-activity-report narratives for AML model training, paired with banking-flow ground-truth labels.
Privacy-Preserving Fingerprints
4 configurable privacy levels with differential privacy guarantees
Standard
1.0
Balanced fidelity and privacy. Suitable for most development and testing use cases.
Enhanced
0.5
Stronger privacy with moderate utility loss. Recommended for sensitive financial domains.
Strict
0.1
Strong differential privacy. For regulated environments requiring formal privacy guarantees.
Maximum
0.01
Highest privacy protection. Near-zero re-identification risk with some statistical fidelity trade-off.
Lower epsilon values provide stronger privacy guarantees. VynFi's fingerprint system captures statistical distributions without storing any individual records, and differential privacy noise is applied before fingerprint export.
How VynFi Compares
A side-by-side evaluation of data sourcing approaches
| Feature | Manual Test Data | Production Copies | Other Synth Tools | VynFi |
|---|---|---|---|---|
| Realism | ||||
| Privacy | ||||
| Cost | ||||
| Scale | ||||
| Compliance | ||||
| Speed |
Use Cases
How organizations leverage VynFi's synthetic data
References
- [1] Dwork, C., & Roth, A. (2014). "The Algorithmic Foundations of Differential Privacy." Foundations and Trends in Theoretical Computer Science, 9(3-4), 211-407.
- [2] Jordon, J., Yoon, J., & van der Schaar, M. (2022). "Synthetic Data: What, Why and How?" arXiv:2205.03257.
- [3] Assefa, S. A., et al. (2020). "Generating Synthetic Data in Finance: Opportunities, Challenges and Pitfalls." NeurIPS Workshop on AI for Financial Services.
- [4] European Commission (2024). "EU Artificial Intelligence Act: Regulation (EU) 2024/1689." Official Journal of the European Union.
- [5] Benford, F. (1938). "The Law of Anomalous Numbers." Proceedings of the American Philosophical Society, 78(4), 551-572.
- [6] Nigrini, M. J. (2012). "Benford's Law: Applications for Forensic Accounting, Auditing, and Fraud Detection." Wiley.
- [7] Patki, N., Wedge, R., & Veeramachaneni, K. (2016). "The Synthetic Data Vault." IEEE International Conference on Data Science and Advanced Analytics.
Start building with synthetic financial data
10,000 free credits every month. No credit card required.