VynFi is in early access — some features may be unavailable.

Methodology

How VynFi generates statistically faithful synthetic financial data

The DataSynth Engine

VynFi is powered by DataSynth, a Rust engine purpose-built for high-throughput synthetic financial data generation.

16 crates

Rust crates

100K+ rows/sec

Throughput

Rust

Language

Proprietary

License

Architecture Layers

1

API Layer

Axum HTTP server with rate limiting and auth middleware

2

Orchestration

Job queue management, credit estimation, and webhook dispatch

3

Generation Core

Schema resolution, distribution sampling, correlation injection

4

Output Layer

Format serialization (JSON, CSV, Parquet) and compression

Generation Pipeline

Every generation request flows through a 5-step pipeline that transforms schema definitions into statistically faithful datasets.

1

Schema Selection

Resolves the target sector and table definitions. Loads column schemas, data types, and constraint rules from the catalog registry.

2

Distribution Sampling

Generates base values using statistical distributions calibrated to real-world financial data. Supports normal, log-normal, Poisson, and custom empirical distributions.

3

Correlation Injection

Applies cross-column and cross-table correlations. Ensures debits balance credits, foreign keys resolve correctly, and temporal sequences are coherent.

4

Anomaly Insertion

Optionally injects realistic anomalies (duplicate entries, round-number bias, off-hours transactions) with configurable frequency and labels.

5

Validation

Runs Benford compliance checks, referential integrity validation, and statistical quality scoring before returning the final dataset.

Statistical Models

The generation core uses several statistical models to produce realistic financial data. Click a model to learn more.

Real-World Calibration

VynFi's distributions and statistical parameters are derived from extensive analysis of real-world financial data.

155

Real-World Datasets

Analyzed across 10 industry sectors for distribution calibration and statistical benchmarking

364M

Journal Entries

In the calibration corpus used to derive realistic financial patterns and temporal dynamics

2.4B

Line Items

Processed to build inter-table correlation models and cross-entity relationship graphs

Copula Families

VynFi uses 5 copula families to model complex dependencies between financial variables. Each family captures different tail dependency and correlation structures.

Copula FamilyTail DependencyUse Case
GaussianNone (symmetric)General-purpose modeling of smooth, symmetric correlations between financial variables
ClaytonLower tailCorrelated loss events and downside risk scenarios where defaults tend to cluster
GumbelUpper tailExtreme revenue spikes and co-movement in high-value transactions
FrankNone (symmetric)Weak to moderate dependencies without tail concentration; balanced risk profiles
Student-tBoth tailsFat-tailed financial distributions with joint extreme events in stress-testing scenarios

Coherence Validators

15 coherence validators run on every generated dataset to ensure cross-table consistency and referential integrity.

#ValidatorDescription
01Debit-Credit BalanceEvery journal entry sums to zero across debit and credit legs
02Trial BalanceTotal debits equal total credits across the general ledger
03Foreign Key IntegrityAll references resolve to valid records in related tables
04Temporal OrderingDocument dates follow logical sequence (PO before GR before Invoice)
05Period BoundariesEntries fall within valid fiscal periods and calendar constraints
06Currency ConsistencyFX amounts reconcile with exchange rates and base currency
07Account HierarchyPosted accounts exist in the chart of accounts and follow hierarchy rules
08Subledger ReconciliationAR/AP/FA/INV subledger totals match corresponding GL control accounts
09Document NumberingSequential document IDs with no gaps or duplicates within each series
10Tax CalculationTax amounts match rate schedules for the jurisdiction and line items
11Intercompany EliminationIC transaction pairs balance and eliminate correctly in consolidation
12Aging ConsistencyReceivable/payable aging buckets sum to outstanding balances
13Quantity-Value MatchInventory quantities times unit costs equal total values
14Approval ChainTransactions above threshold have the required authorization records
15Entity Cross-ReferenceMulti-entity datasets maintain consistent entity identifiers across all tables

Datasets that fail any validator are automatically rejected and regenerated. Validation results are included in the quality report for every job.

Benford's Law Compliance

VynFi achieves excellent Benford's Law conformity across all monetary fields, a critical quality metric for financial data realism.

Test Results

Mean Absolute Deviation (MAD)< 0.006
Nigrini ClassificationExcellent Conformity
Chi-Squared TestPass (p > 0.05)

How It Works

The engine calibrates first-digit frequencies to match the expected distribution p(d) = log10(1 + 1/d). During validation, MAD is computed as the average absolute difference between observed and expected first-digit proportions. A MAD below 0.006 is classified as "close conformity" per Nigrini's threshold table. VynFi consistently achieves this benchmark across all sectors and monetary value columns.

Fingerprint System

VynFi Fingerprints capture the statistical DNA of a real dataset without storing any actual records. Upload your data to create a .dsf fingerprint, then use it to generate unlimited synthetic data that matches your production distributions.

Fingerprint Details

Format.dsf (DataSynth Fingerprint)
StructureZIP archive containing schema, distributions, and correlation matrices
EncryptionAES-256-GCM with per-fingerprint key wrapping
LicensingFingerprints are licensed per-organization with usage metering

Quality Evaluation

Every generated dataset is scored across three dimensions to ensure it meets production-grade quality standards.

Statistical

Fidelity

Measures how closely the synthetic data mirrors the statistical properties of real-world financial data. Evaluated using KS tests, Wasserstein distance, and correlation matrix similarity.

Functional

Utility

Assesses whether the synthetic data produces equivalent results when used for downstream tasks (model training, analytics, testing). Measured via train-on-synthetic/test-on-real benchmarks.

Security

Privacy

Verifies that no individual record in the synthetic dataset can be linked back to a real entity. Uses membership inference attacks and nearest-neighbor distance ratios as privacy guarantees.

Credit Formula

Credits consumed per request are calculated deterministically so you always know the cost before generating.

Formula

credits = rows x base_rate x sector_mult x label_mult

Base Rates

Data TypeRateUnit
Journal entries1 creditper row
Chart of accounts0.5 creditsper account
Master data1 creditper record
Document flow chain5 creditsper chain
Intercompany matched pairs8 creditsper pair
Full P2P cycle10 creditsper cycle
Banking/KYC profile3 creditsper customer
OCEL 2.0 event log2 creditsper event
Audit workpaper package15 creditsper engagement

Worked Example

Generate 10,000 journal entries for a curated banking sector pack with anomaly labels:

text
rows = 10,000
base_rate = 1 credit/row (journal entries)
sector_mult = 1.5x (curated sector pack)
label_mult = 1.3x (anomaly labels)
credits = 10,000 × 1 × 1.5 × 1.3
= 19,500 credits