Methodology
How VynFi generates statistically faithful synthetic financial data
The DataSynth Engine
VynFi is powered by DataSynth, a Rust engine purpose-built for high-throughput synthetic financial data generation.
16 crates
Rust crates
100K+ rows/sec
Throughput
Rust
Language
Proprietary
License
Architecture Layers
API Layer
Axum HTTP server with rate limiting and auth middleware
Orchestration
Job queue management, credit estimation, and webhook dispatch
Generation Core
Schema resolution, distribution sampling, correlation injection
Output Layer
Format serialization (JSON, CSV, Parquet) and compression
Generation Pipeline
Every generation request flows through a 5-step pipeline that transforms schema definitions into statistically faithful datasets.
Schema Selection
Resolves the target sector and table definitions. Loads column schemas, data types, and constraint rules from the catalog registry.
Distribution Sampling
Generates base values using statistical distributions calibrated to real-world financial data. Supports normal, log-normal, Poisson, and custom empirical distributions.
Correlation Injection
Applies cross-column and cross-table correlations. Ensures debits balance credits, foreign keys resolve correctly, and temporal sequences are coherent.
Anomaly Insertion
Optionally injects realistic anomalies (duplicate entries, round-number bias, off-hours transactions) with configurable frequency and labels.
Validation
Runs Benford compliance checks, referential integrity validation, and statistical quality scoring before returning the final dataset.
Statistical Models
The generation core uses several statistical models to produce realistic financial data. Click a model to learn more.
Real-World Calibration
VynFi's distributions and statistical parameters are derived from extensive analysis of real-world financial data.
155
Real-World Datasets
Analyzed across 10 industry sectors for distribution calibration and statistical benchmarking
364M
Journal Entries
In the calibration corpus used to derive realistic financial patterns and temporal dynamics
2.4B
Line Items
Processed to build inter-table correlation models and cross-entity relationship graphs
Copula Families
VynFi uses 5 copula families to model complex dependencies between financial variables. Each family captures different tail dependency and correlation structures.
| Copula Family | Tail Dependency | Use Case |
|---|---|---|
| Gaussian | None (symmetric) | General-purpose modeling of smooth, symmetric correlations between financial variables |
| Clayton | Lower tail | Correlated loss events and downside risk scenarios where defaults tend to cluster |
| Gumbel | Upper tail | Extreme revenue spikes and co-movement in high-value transactions |
| Frank | None (symmetric) | Weak to moderate dependencies without tail concentration; balanced risk profiles |
| Student-t | Both tails | Fat-tailed financial distributions with joint extreme events in stress-testing scenarios |
Coherence Validators
15 coherence validators run on every generated dataset to ensure cross-table consistency and referential integrity.
| # | Validator | Description |
|---|---|---|
| 01 | Debit-Credit Balance | Every journal entry sums to zero across debit and credit legs |
| 02 | Trial Balance | Total debits equal total credits across the general ledger |
| 03 | Foreign Key Integrity | All references resolve to valid records in related tables |
| 04 | Temporal Ordering | Document dates follow logical sequence (PO before GR before Invoice) |
| 05 | Period Boundaries | Entries fall within valid fiscal periods and calendar constraints |
| 06 | Currency Consistency | FX amounts reconcile with exchange rates and base currency |
| 07 | Account Hierarchy | Posted accounts exist in the chart of accounts and follow hierarchy rules |
| 08 | Subledger Reconciliation | AR/AP/FA/INV subledger totals match corresponding GL control accounts |
| 09 | Document Numbering | Sequential document IDs with no gaps or duplicates within each series |
| 10 | Tax Calculation | Tax amounts match rate schedules for the jurisdiction and line items |
| 11 | Intercompany Elimination | IC transaction pairs balance and eliminate correctly in consolidation |
| 12 | Aging Consistency | Receivable/payable aging buckets sum to outstanding balances |
| 13 | Quantity-Value Match | Inventory quantities times unit costs equal total values |
| 14 | Approval Chain | Transactions above threshold have the required authorization records |
| 15 | Entity Cross-Reference | Multi-entity datasets maintain consistent entity identifiers across all tables |
Datasets that fail any validator are automatically rejected and regenerated. Validation results are included in the quality report for every job.
Benford's Law Compliance
VynFi achieves excellent Benford's Law conformity across all monetary fields, a critical quality metric for financial data realism.
Test Results
How It Works
The engine calibrates first-digit frequencies to match the expected distribution p(d) = log10(1 + 1/d). During validation, MAD is computed as the average absolute difference between observed and expected first-digit proportions. A MAD below 0.006 is classified as "close conformity" per Nigrini's threshold table. VynFi consistently achieves this benchmark across all sectors and monetary value columns.
Fingerprint System
VynFi Fingerprints capture the statistical DNA of a real dataset without storing any actual records. Upload your data to create a .dsf fingerprint, then use it to generate unlimited synthetic data that matches your production distributions.
Fingerprint Details
| Format | .dsf (DataSynth Fingerprint) |
| Structure | ZIP archive containing schema, distributions, and correlation matrices |
| Encryption | AES-256-GCM with per-fingerprint key wrapping |
| Licensing | Fingerprints are licensed per-organization with usage metering |
Quality Evaluation
Every generated dataset is scored across three dimensions to ensure it meets production-grade quality standards.
Fidelity
Measures how closely the synthetic data mirrors the statistical properties of real-world financial data. Evaluated using KS tests, Wasserstein distance, and correlation matrix similarity.
Utility
Assesses whether the synthetic data produces equivalent results when used for downstream tasks (model training, analytics, testing). Measured via train-on-synthetic/test-on-real benchmarks.
Privacy
Verifies that no individual record in the synthetic dataset can be linked back to a real entity. Uses membership inference attacks and nearest-neighbor distance ratios as privacy guarantees.
Credit Formula
Credits consumed per request are calculated deterministically so you always know the cost before generating.
Formula
credits = rows x base_rate x sector_mult x label_multBase Rates
| Data Type | Rate | Unit |
|---|---|---|
| Journal entries | 1 credit | per row |
| Chart of accounts | 0.5 credits | per account |
| Master data | 1 credit | per record |
| Document flow chain | 5 credits | per chain |
| Intercompany matched pairs | 8 credits | per pair |
| Full P2P cycle | 10 credits | per cycle |
| Banking/KYC profile | 3 credits | per customer |
| OCEL 2.0 event log | 2 credits | per event |
| Audit workpaper package | 15 credits | per engagement |
Worked Example
Generate 10,000 journal entries for a curated banking sector pack with anomaly labels:
rows = 10,000base_rate = 1 credit/row (journal entries)sector_mult = 1.5x (curated sector pack)label_mult = 1.3x (anomaly labels)credits = 10,000 × 1 × 1.5 × 1.3 = 19,500 credits