machine learningAItraining dataPythondata scienceDataSynth 3.1.1

Building Financial AI Models? Here's Your Training Data Pipeline

Synthetic financial data beats anonymized real data for ML training. Benford compliance, balanced entries, ground-truth labels, and unlimited scale via API.

VynFi Team · EngineeringApril 10, 202610 min read

If you are building machine learning models for financial applications, fraud detection, anomaly scoring, transaction classification, revenue forecasting, you have a data problem. Real financial data is sensitive, expensive to license, and almost never comes with ground-truth labels. Anonymized datasets lose the statistical properties your model needs. And hand-curated datasets are too small to train anything meaningful.

This post walks through how to use VynFi as a training data pipeline for financial ML models. We will cover why synthetic data works, what statistical properties matter, and how to generate labeled datasets at the scale your models need.

**Ready-to-train datasets (DS 3.1.1):** Six refreshed datasets on Hugging Face that skip the generation step entirely — vynfi-journal-entries-1m, vynfi-aml-100k, vynfi-audit-p2p, vynfi-sar-narratives, vynfi-supply-chain-ocel, vynfi-ocel-manufacturing. All include the new `is_fraud_propagated` split for stratified training, microsec OCEL timestamps (pandas-safe), and the behavioral-bias lift that makes weekend/round-dollar/post-close features actually carry signal. See ml_training_pipeline.py for the end-to-end RF baseline.

Why Synthetic Beats Anonymized

Anonymization sounds like it should work: take real data, strip identifying information, and train on the result. In practice, anonymization degrades the exact properties that make financial data useful for training:

Amount distributions get distorted: Bucketing, rounding, or adding noise to monetary amounts destroys the Benford's Law compliance that distinguishes real transactions from synthetic ones. Your model learns on distorted distributions.
Temporal patterns disappear: Anonymizing dates by shifting or randomizing them eliminates the month-end spikes, quarter-end patterns, and day-of-week effects that are critical features for anomaly detection.
Relationships break: Anonymizing accounts and counterparties independently breaks the referential integrity between journals, subledgers, and master data. Your model cannot learn cross-table patterns.
Labels do not exist: The biggest problem. Real financial data almost never comes labeled. You do not know which transactions are fraudulent, which entries are errors, and which patterns are anomalies. Without labels, you are limited to unsupervised methods.

Synthetic data sidesteps all of these problems. The generation engine produces data with correct statistical properties from the start. Relationships are structurally guaranteed. And every anomalous record carries a ground-truth label.

Statistical Properties That Matter

VynFi's DataSynth engine is calibrated to produce data with the statistical properties that financial ML models depend on:

Benford's Law Compliance

The leading-digit distribution of transaction amounts follows Benford's Law. This matters because many fraud detection models use Benford deviation as a primary feature. If your training data does not follow Benford's Law, your model will either miss real anomalies or flag everything.

VynFi generates data that passes Benford's Law chi-squared tests at the 95% confidence level for all non-anomalous records. Injected anomalies intentionally deviate from Benford's Law at configurable severity levels, giving your model clean positive and negative examples.

Balanced Double-Entry Accounting

Every journal entry set is balanced: total debits equal total credits. This is fundamental to financial data but surprisingly hard to achieve with naive generation approaches. Models trained on unbalanced data learn spurious patterns.

Temporal Realism

Transaction timestamps reflect real-world patterns: higher volumes on business days, month-end spikes, quarter-end closing entries, and year-end adjustments. Anomalous entries often appear at unusual times (weekends, late at night, just before period close), and VynFi's fraud injection replicates these temporal signatures.

Generating Labeled Training Data

Here is a complete Python example that generates a labeled fraud detection training set with 100,000 journal entries, a 2% fraud rate across three fraud types, and exports it as a Pandas DataFrame ready for model training:

Python

import vynfi
import pandas as pd
from sklearn.model_selection import train_test_split
client = vynfi.Client(api_key="vf_live_abc123...")
# Generate labeled training data
job = client.generate(
    config={
        "sector": "banking",
        "tables": [{
            "name": "journal_entries",
            "rows": 100000,
        }],
        "fraudPacks": [
            "round_tripping",
            "ghost_vendors",
            "revenue_inflation",
        ],
        "fraudRate": 0.02,
        "exportFormat": "json",
    }
)
job.wait()
# Convert to DataFrame
df = pd.DataFrame(job.data["journal_entries"])
# Inspect class distribution
print(df["is_anomaly"].value_counts())
# False    98000
# True      2000
print(df[df["is_anomaly"]]["anomaly_type"].value_counts())
# round_tripping       680
# ghost_vendors         660
# revenue_inflation     660
# Feature engineering
df["amount"] = df["debit"] + df["credit"]
df["log_amount"] = df["amount"].apply(lambda x: np.log1p(x))
df["day_of_week"] = pd.to_datetime(df["date"]).dt.dayofweek
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
df["is_month_end"] = pd.to_datetime(df["date"]).dt.is_month_end.astype(int)
df["leading_digit"] = df["amount"].apply(
    lambda x: int(str(abs(x)).lstrip("0.")[0]) if x != 0 else 0
)
# Train/test split preserving class balance
X = df[["log_amount", "day_of_week", "is_weekend",
        "is_month_end", "leading_digit"]]
y = df["is_anomaly"].astype(int)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
print(f"Training set: {len(X_train)} rows, "
      f"{y_train.sum()} anomalies ({y_train.mean():.1%})")
print(f"Test set:     {len(X_test)} rows, "
      f"{y_test.sum()} anomalies ({y_test.mean():.1%})")

Batch Generation for Large Datasets

Training serious models often requires millions of rows. VynFi supports async generation for large datasets, and you can run multiple jobs in parallel to build up your training corpus. Here is how to generate 1 million rows across 10 parallel jobs:

Python

import vynfi
import asyncio
client = vynfi.Client(api_key="vf_live_abc123...")
async def generate_batch(batch_id: int, rows: int):
    """Generate a single batch of training data."""
    job = await client.generate_async(
        config={
            "sector": "banking",
            "tables": [{
                "name": "journal_entries",
                "rows": rows,
            }],
            "fraudPacks": [
                "round_tripping",
                "ghost_vendors",
                "revenue_inflation",
            ],
            "fraudRate": 0.02,
            "exportFormat": "parquet",
            "seed": batch_id,  # Reproducible per batch
        }
    )
    await job.wait_async()
    await job.download(f"./training-data/batch_{batch_id:03d}.parquet")
    print(f"Batch {batch_id} complete: {rows} rows")
async def main():
    # 10 parallel batches of 100K rows each = 1M total
    tasks = [
        generate_batch(i, 100_000) for i in range(10)
    ]
    await asyncio.gather(*tasks)
    print("All batches complete. 1M rows generated.")
asyncio.run(main())

Use the seed parameter to make each batch reproducible. This is critical for ML experiments: you can regenerate the exact same training data to reproduce results.

Controlling Fraud Scenarios

Different models need different training distributions. A general anomaly detector needs a low fraud rate with diverse fraud types. A specialized round-tripping classifier needs heavy representation of that specific pattern. VynFi gives you fine-grained control:

Python

# Scenario 1: General anomaly detection
# Low rate, diverse types, realistic class imbalance
general_config = {
    "fraudPacks": [
        "round_tripping",
        "ghost_vendors",
        "revenue_inflation",
        "expense_manipulation",
        "related_party_hidden",
    ],
    "fraudRate": 0.01,  # 1% — realistic for production data
}
# Scenario 2: Specialized round-tripping detector
# Higher rate of target fraud, fewer distractors
roundtrip_config = {
    "fraudPacks": ["round_tripping"],
    "fraudRate": 0.10,  # 10% for balanced training
}
# Scenario 3: Multi-label classification
# Multiple fraud types can co-occur on the same entity
multilabel_config = {
    "fraudPacks": [
        "revenue_inflation",
        "expense_manipulation",
        "ghost_vendors",
    ],
    "fraudRate": 0.05,
    "fraudOverlap": True,  # Allow multiple labels per entity
}

Quality Validation

Before feeding generated data into your training pipeline, validate its statistical properties. VynFi includes quality metrics with every generation job:

Python

# Access quality metrics from a completed job
metrics = job.quality_metrics
print(f"Benford chi-squared p-value: {metrics.benford_pvalue:.4f}")
print(f"Debit/credit balance: {metrics.balance_check}")
print(f"Referential integrity: {metrics.referential_integrity}")
print(f"Null rate: {metrics.null_rate:.4%}")
print(f"Duplicate rate: {metrics.duplicate_rate:.4%}")
# Example output:
# Benford chi-squared p-value: 0.8721
# Debit/credit balance: PASS
# Referential integrity: PASS
# Null rate: 0.0000%
# Duplicate rate: 0.0012%

A Benford p-value above 0.05 means the data passes the chi-squared goodness-of-fit test for Benford's Law at the 95% confidence level. VynFi typically achieves p-values above 0.80 for non-anomalous records.

Comparison with Alternatives

Here is how VynFi compares to other approaches for generating financial ML training data:

GANs/VAEs on real data: Requires access to real data in the first place. Output often fails basic accounting invariants (unbalanced entries). No ground-truth labels.
Rule-based generators: Produce data that is too clean. Models overfit to the generation rules instead of learning generalizable patterns.
Public datasets (e.g., IEEE-CIS): Limited to specific domains, fixed size, no configurability. Often over-studied, leading to inflated benchmark scores that do not generalize.
VynFi: Configurable fraud types and rates, statistical fidelity, ground-truth labels, unlimited scale, and proper accounting invariants. Purpose-built for financial ML.

Getting Started

Install the Python SDK and generate your first training batch in under a minute:

Bash

pip install vynfi
# Set your API key
export VYNFI_API_KEY="vf_live_abc123..."
# Quick test: generate 10K labeled journal entries
python -c "
import vynfi
client = vynfi.Client()
result = client.generate.quick(
    sector='banking',
    tables=[{'name': 'journal_entries', 'rows': 10000}],
    fraud_packs=['round_tripping'],
    fraud_rate=0.02,
    format='json',
)
print(f'Generated {result.metadata.row_count} rows')
print(f'Anomalies: {sum(1 for r in result.data.journal_entries if r.is_anomaly)}')
"

The Free tier gives you 10,000 credits per month. For ML workloads that require larger volumes, the Developer tier at $49/month provides 100,000 credits, enough for millions of training rows per month. Scale tier at $499/month unlocks concurrent batch generation and priority processing for production training pipelines.

Ready to try VynFi?

Start generating synthetic financial data with 10,000 free credits. No credit card required.