Building Financial AI Models? Here's Your Training Data Pipeline
Synthetic financial data beats anonymized real data for ML training. Benford compliance, balanced entries, ground-truth labels, and unlimited scale via API.
If you are building machine learning models for financial applications, fraud detection, anomaly scoring, transaction classification, revenue forecasting, you have a data problem. Real financial data is sensitive, expensive to license, and almost never comes with ground-truth labels. Anonymized datasets lose the statistical properties your model needs. And hand-curated datasets are too small to train anything meaningful.
This post walks through how to use VynFi as a training data pipeline for financial ML models. We will cover why synthetic data works, what statistical properties matter, and how to generate labeled datasets at the scale your models need.
Why Synthetic Beats Anonymized
Anonymization sounds like it should work: take real data, strip identifying information, and train on the result. In practice, anonymization degrades the exact properties that make financial data useful for training:
- Amount distributions get distorted: Bucketing, rounding, or adding noise to monetary amounts destroys the Benford's Law compliance that distinguishes real transactions from synthetic ones. Your model learns on distorted distributions.
- Temporal patterns disappear: Anonymizing dates by shifting or randomizing them eliminates the month-end spikes, quarter-end patterns, and day-of-week effects that are critical features for anomaly detection.
- Relationships break: Anonymizing accounts and counterparties independently breaks the referential integrity between journals, subledgers, and master data. Your model cannot learn cross-table patterns.
- Labels do not exist: The biggest problem. Real financial data almost never comes labeled. You do not know which transactions are fraudulent, which entries are errors, and which patterns are anomalies. Without labels, you are limited to unsupervised methods.
Synthetic data sidesteps all of these problems. The generation engine produces data with correct statistical properties from the start. Relationships are structurally guaranteed. And every anomalous record carries a ground-truth label.
Statistical Properties That Matter
VynFi's DataSynth engine is calibrated to produce data with the statistical properties that financial ML models depend on:
Benford's Law Compliance
The leading-digit distribution of transaction amounts follows Benford's Law. This matters because many fraud detection models use Benford deviation as a primary feature. If your training data does not follow Benford's Law, your model will either miss real anomalies or flag everything.
VynFi generates data that passes Benford's Law chi-squared tests at the 95% confidence level for all non-anomalous records. Injected anomalies intentionally deviate from Benford's Law at configurable severity levels, giving your model clean positive and negative examples.
Balanced Double-Entry Accounting
Every journal entry set is balanced: total debits equal total credits. This is fundamental to financial data but surprisingly hard to achieve with naive generation approaches. Models trained on unbalanced data learn spurious patterns.
Temporal Realism
Transaction timestamps reflect real-world patterns: higher volumes on business days, month-end spikes, quarter-end closing entries, and year-end adjustments. Anomalous entries often appear at unusual times (weekends, late at night, just before period close), and VynFi's fraud injection replicates these temporal signatures.
Generating Labeled Training Data
Here is a complete Python example that generates a labeled fraud detection training set with 100,000 journal entries, a 2% fraud rate across three fraud types, and exports it as a Pandas DataFrame ready for model training:
import vynfiimport pandas as pdfrom sklearn.model_selection import train_test_splitclient = vynfi.Client(api_key="vf_live_abc123...")# Generate labeled training datajob = client.generate( config={ "sector": "banking", "tables": [{ "name": "journal_entries", "rows": 100000, }], "fraudPacks": [ "round_tripping", "ghost_vendors", "revenue_inflation", ], "fraudRate": 0.02, "exportFormat": "json", })job.wait()# Convert to DataFramedf = pd.DataFrame(job.data["journal_entries"])# Inspect class distributionprint(df["is_anomaly"].value_counts())# False 98000# True 2000print(df[df["is_anomaly"]]["anomaly_type"].value_counts())# round_tripping 680# ghost_vendors 660# revenue_inflation 660# Feature engineeringdf["amount"] = df["debit"] + df["credit"]df["log_amount"] = df["amount"].apply(lambda x: np.log1p(x))df["day_of_week"] = pd.to_datetime(df["date"]).dt.dayofweekdf["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)df["is_month_end"] = pd.to_datetime(df["date"]).dt.is_month_end.astype(int)df["leading_digit"] = df["amount"].apply( lambda x: int(str(abs(x)).lstrip("0.")[0]) if x != 0 else 0)# Train/test split preserving class balanceX = df[["log_amount", "day_of_week", "is_weekend", "is_month_end", "leading_digit"]]y = df["is_anomaly"].astype(int)X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=42)print(f"Training set: {len(X_train)} rows, " f"{y_train.sum()} anomalies ({y_train.mean():.1%})")print(f"Test set: {len(X_test)} rows, " f"{y_test.sum()} anomalies ({y_test.mean():.1%})")Batch Generation for Large Datasets
Training serious models often requires millions of rows. VynFi supports async generation for large datasets, and you can run multiple jobs in parallel to build up your training corpus. Here is how to generate 1 million rows across 10 parallel jobs:
import vynfiimport asyncioclient = vynfi.Client(api_key="vf_live_abc123...")async def generate_batch(batch_id: int, rows: int): """Generate a single batch of training data.""" job = await client.generate_async( config={ "sector": "banking", "tables": [{ "name": "journal_entries", "rows": rows, }], "fraudPacks": [ "round_tripping", "ghost_vendors", "revenue_inflation", ], "fraudRate": 0.02, "exportFormat": "parquet", "seed": batch_id, # Reproducible per batch } ) await job.wait_async() await job.download(f"./training-data/batch_{batch_id:03d}.parquet") print(f"Batch {batch_id} complete: {rows} rows")async def main(): # 10 parallel batches of 100K rows each = 1M total tasks = [ generate_batch(i, 100_000) for i in range(10) ] await asyncio.gather(*tasks) print("All batches complete. 1M rows generated.")asyncio.run(main())Use the seed parameter to make each batch reproducible. This is critical for ML experiments: you can regenerate the exact same training data to reproduce results.
Controlling Fraud Scenarios
Different models need different training distributions. A general anomaly detector needs a low fraud rate with diverse fraud types. A specialized round-tripping classifier needs heavy representation of that specific pattern. VynFi gives you fine-grained control:
# Scenario 1: General anomaly detection# Low rate, diverse types, realistic class imbalancegeneral_config = { "fraudPacks": [ "round_tripping", "ghost_vendors", "revenue_inflation", "expense_manipulation", "related_party_hidden", ], "fraudRate": 0.01, # 1% — realistic for production data}# Scenario 2: Specialized round-tripping detector# Higher rate of target fraud, fewer distractorsroundtrip_config = { "fraudPacks": ["round_tripping"], "fraudRate": 0.10, # 10% for balanced training}# Scenario 3: Multi-label classification# Multiple fraud types can co-occur on the same entitymultilabel_config = { "fraudPacks": [ "revenue_inflation", "expense_manipulation", "ghost_vendors", ], "fraudRate": 0.05, "fraudOverlap": True, # Allow multiple labels per entity}Quality Validation
Before feeding generated data into your training pipeline, validate its statistical properties. VynFi includes quality metrics with every generation job:
# Access quality metrics from a completed jobmetrics = job.quality_metricsprint(f"Benford chi-squared p-value: {metrics.benford_pvalue:.4f}")print(f"Debit/credit balance: {metrics.balance_check}")print(f"Referential integrity: {metrics.referential_integrity}")print(f"Null rate: {metrics.null_rate:.4%}")print(f"Duplicate rate: {metrics.duplicate_rate:.4%}")# Example output:# Benford chi-squared p-value: 0.8721# Debit/credit balance: PASS# Referential integrity: PASS# Null rate: 0.0000%# Duplicate rate: 0.0012%A Benford p-value above 0.05 means the data passes the chi-squared goodness-of-fit test for Benford's Law at the 95% confidence level. VynFi typically achieves p-values above 0.80 for non-anomalous records.
Comparison with Alternatives
Here is how VynFi compares to other approaches for generating financial ML training data:
- GANs/VAEs on real data: Requires access to real data in the first place. Output often fails basic accounting invariants (unbalanced entries). No ground-truth labels.
- Rule-based generators: Produce data that is too clean. Models overfit to the generation rules instead of learning generalizable patterns.
- Public datasets (e.g., IEEE-CIS): Limited to specific domains, fixed size, no configurability. Often over-studied, leading to inflated benchmark scores that do not generalize.
- VynFi: Configurable fraud types and rates, statistical fidelity, ground-truth labels, unlimited scale, and proper accounting invariants. Purpose-built for financial ML.
Getting Started
Install the Python SDK and generate your first training batch in under a minute:
pip install vynfi# Set your API keyexport VYNFI_API_KEY="vf_live_abc123..."# Quick test: generate 10K labeled journal entriespython -c "import vynficlient = vynfi.Client()result = client.generate.quick( sector='banking', tables=[{'name': 'journal_entries', 'rows': 10000}], fraud_packs=['round_tripping'], fraud_rate=0.02, format='json',)print(f'Generated {result.metadata.row_count} rows')print(f'Anomalies: {sum(1 for r in result.data.journal_entries if r.is_anomaly)}')"The Free tier gives you 10,000 credits per month. For ML workloads that require larger volumes, the Developer tier at $49/month provides 100,000 credits, enough for millions of training rows per month. Scale tier at $499/month unlocks concurrent batch generation and priority processing for production training pipelines.