fraud-detectionmachine-learningcomplianceDataSynth 3.1.1

130+ Fraud Scenarios: Building Better Fraud Detection Models

How VynFi generates labeled fraud training data with 130+ anomaly subtypes, multi-stage fraud schemes, and ground-truth labels across all five knowledge dimensions.

VynFi Research · Founder & Lead ResearcherApril 9, 202610 min read

If you are building a fraud detection model, you have a data problem. Not a data volume problem. A data label problem. Real fraud is rare (typically 0.1 to 2 percent of transactions), unevenly distributed, and almost never comprehensively labeled. The fraud that gets caught is labeled after the fact. The fraud that does not get caught is not labeled at all. And the fraud that was prevented never appears in the data.

This creates a fundamental training data gap. You cannot build a reliable classifier on a dataset where most of the positive class is missing and the labels you do have are biased toward the easiest-to-detect schemes. The result is models that catch the fraud patterns your existing controls already catch, while missing the sophisticated schemes that actually matter.

This post draws on the anomaly injection framework described in "DataSynth: Reference Knowledge Graphs for Enterprise Audit Analytics through Synthetic Data Generation with Provable Statistical Properties" by the VynFi research team (April 2026, under peer review).

**DataSynth 3.1.1 update (2026-04-19):** Behavioral fraud biases (weekend ×32, round-dollar ×170, off-hours ×1.6, post-close ×3,106 lift) now fire on every `is_fraud` path, and document-level fraud correctly propagates via `is_fraud_propagated` / `fraud_source_document_id`. Split scheme-level fraud (document ring fan-out) from line-level injection with `client.jobs.fraud_split(job_id)`. Regenerated dataset: VynFi/vynfi-journal-entries-1m. See behavioral_fraud_patterns.py for the lift-verification recipe.

The Labeled Data Scarcity Problem

Consider what a supervised fraud detection model needs to train effectively: a large corpus of transactions with binary or multi-class labels indicating whether each transaction is fraudulent and, ideally, what type of fraud it represents. In practice, here is what you typically have:

Class imbalance: Fraud represents less than 1% of transactions. Standard classifiers overwhelm with false negatives.
Incomplete labels: Only detected fraud is labeled. Undetected fraud sits in your training set as false negatives, teaching your model that those patterns are normal.
Label latency: Fraud is often discovered months or years after it occurs. Your most recent data has the fewest labels.
Label bias: Detection bias means your labels over-represent simple schemes (duplicate payments, round-number fraud) and under-represent complex schemes (vendor kickbacks, revenue manipulation).
Privacy restrictions: You cannot share labeled fraud data across organizations because it contains sensitive financial information about real entities.

VynFi's Anomaly Injection Framework

VynFi's DataSynth engine takes a fundamentally different approach to fraud data. Instead of relying on historical labels, it generates synthetic transactions with fraud injected by construction. Every anomalous record has a ground-truth label because the engine created the anomaly deliberately, with full knowledge of what it is, how severe it is, and how it relates to surrounding transactions.

The framework includes over 130 anomaly subtypes organized across five knowledge dimensions:

Temporal anomalies: Weekend postings, off-hours entries, holiday transactions, backdating, future-dating, period-end clustering, unusual transaction frequencies.
Amount anomalies: Round numbers, just-below-threshold values, Benford violations, duplicate amounts, outlier magnitudes, split transactions designed to evade controls, structuring patterns.
Relationship anomalies: Ghost vendors, missing approval chains, circular transaction references, orphan entries with no matching counterparty, self-dealing patterns between related entities.
Pattern anomalies: Duplicate payments, sequential invoice numbers from different vendors, round-trip payment flows, gradual escalation patterns, clustering behaviors, layering sequences.
Structural anomalies: Missing required fields, schema violations, referential integrity breaks, encoding inconsistencies, metadata that contradicts transaction content.

Multi-Stage Fraud Schemes

Simple anomaly injection produces isolated suspicious transactions. Real fraud is different. It evolves over time through coordinated multi-stage schemes where individual transactions may appear normal but the pattern across transactions reveals the scheme.

VynFi models complex fraud schemes as state machines that progress through defined stages. Each stage generates transactions that are individually plausible but collectively form a detectable pattern. Here are three examples:

Vendor Kickback Scheme

A vendor kickback scheme in VynFi progresses through stages: vendor onboarding with minimal due diligence, initial legitimate transactions to establish a baseline, gradual introduction of inflated invoices or fictitious line items, split payments to keep individual amounts below approval thresholds, and periodic payments to an employee's related entity. The engine generates the full transaction chain with timestamps, amounts, and entity relationships that mirror how real kickback schemes operate.

Ghost Employee Scheme

Ghost employee fraud involves creating fictitious employees on the payroll. VynFi models the lifecycle: creation of an employee record with plausible but fabricated details, bank account setup (often sharing attributes with an existing employee), regular payroll disbursements, absence of typical employee activity like expense reports or time entries, and eventual escalation as the perpetrator adds more ghosts or increases pay rates.

Revenue Manipulation Scheme

Revenue manipulation schemes involve premature recognition, channel stuffing, or round-trip transactions. VynFi generates these as multi-quarter patterns: Q4 revenue spikes from channel stuffing followed by Q1 returns, bill-and-hold arrangements with unusual terms, side agreements that alter the substance of transactions, and progressive escalation as targets increase year over year.

Ground-Truth Labels Across Five Dimensions

Every anomalous record VynFi generates carries labels across five dimensions, giving you rich training signal beyond a simple fraud/not-fraud binary:

Anomaly type: The specific subtype (e.g., 'vendor_kickback_inflated_invoice' or 'ghost_employee_payroll_disbursement').
Severity: Financial impact level (low, medium, high, critical) based on the magnitude of the fraudulent amount relative to normal activity.
Difficulty: How hard the anomaly is to detect (easy, moderate, hard, expert). Easy anomalies violate obvious rules. Expert-level anomalies require cross-referencing multiple data sources.
Confidence: The certainty that the record is truly anomalous (0.0 to 1.0). Some injected anomalies mimic patterns that could also occur legitimately, and the confidence score reflects this ambiguity.
Scheme membership: For multi-stage fraud, which scheme the record belongs to and its position in the sequence. This lets you train models that detect fraud patterns, not just individual suspicious transactions.

Generating a Training Set with Specific Fraud Profiles

Here is how you would generate a fraud detection training dataset using the VynFi API. This example creates a dataset with a 2% overall fraud rate, weighted toward vendor-related and revenue manipulation schemes:

Bash

curl -X POST https://api.vynfi.com/v1/generate \
  -H "Authorization: Bearer vf_live_abc123..." \
  -H "Content-Type: application/json" \
  -d '{
    "config": {
      "sector": "manufacturing",
      "rows": 100000,
      "companies": 3,
      "periods": 12,
      "fraudPacks": [
        "vendor_kickback",
        "ghost_employee",
        "revenue_manipulation",
        "duplicate_payment"
      ],
      "fraudRate": 0.02,
      "fraudWeights": {
        "vendor_kickback": 0.35,
        "ghost_employee": 0.15,
        "revenue_manipulation": 0.35,
        "duplicate_payment": 0.15
      },
      "multiStageFraud": true,
      "exportFormat": "parquet",
      "includeLabels": true
    }
  }'

The response includes a dataset where 2% of records are anomalous, distributed according to your specified weights. Multi-stage fraud schemes span multiple records that are linked by scheme identifiers. Labels are included as additional columns in the output, so you can directly use the dataset for supervised learning without any preprocessing.

Practical Tips for Model Training

Start with a high fraud rate (5-10%) for initial model development, then reduce to realistic rates (0.5-2%) for final evaluation.
Use the difficulty dimension to create progressive training curricula: train on easy anomalies first, then fine-tune on harder ones.
Generate multiple datasets with different fraud mixes to evaluate model robustness across scheme types.
Use scheme membership labels to train sequence models (LSTMs, Transformers) that detect fraud patterns across multiple transactions.
Combine VynFi data with your real (limited) labeled data for transfer learning approaches.

VynFi's Free tier (10,000 credits/month) is enough to generate several training datasets for experimentation. For production model training at scale, the Team and Scale tiers provide the volume needed for robust training and evaluation.

Ready to try VynFi?

Start generating synthetic financial data with 10,000 free credits. No credit card required.