fraud-detectionbehavioral-fidelitysynthetic-databenchmarksmachine-learning

Synthetic data can pass every quality test and still break your fraud detector

A third-party benchmark (Sajja, 2026) shows generators with near-perfect statistical fidelity destroy the temporal, velocity, and graph signals fraud detection actually relies on. Why it happens — and why forward generation is the fix.

VynFi Research · Founder & Lead ResearcherJune 7, 20269 min read

You validated your synthetic data. Benford's law looked clean, the pairwise correlations matched, and a classifier trained on the synthetic data scored a downstream AUROC close to the real-data baseline. So you trained a fraud detector on it, tuned the thresholds, and shipped. In production it fires at the wrong rate on real fraud. The data passed every test you ran — the tests were measuring the wrong thing.

This post is grounded in an independent benchmark — Bhavana Sajja, "Synthetic Tabular Generators Fail to Preserve Behavioral Fraud Patterns: A Benchmark on Temporal, Velocity, and Multi-Account Signals," arXiv:2604.13125 (April 2026). It is third-party work, not VynFi's, and its evaluation framework is open source.

Two dimensions everyone measures — and one nobody does

Synthetic-data evaluation today measures two things. Statistical fidelity: do the marginal distributions and pairwise correlations match the real data? And downstream utility: does a model trained on synthetic data generalize to real data — the train-on-synthetic, test-on-real (TSTR) protocol? Both are necessary. Both are insufficient for fraud.

Fraud detection is a behavioral problem. Production systems do not flag a transaction because its amount is unusual in isolation — they flag sequences. A card running three transactions in sixty seconds. A cluster of accounts registered within hours that share the same device or IP. An amount that is ten times a customer's thirty-day median. These temporal, velocity, and graph signals are the operational basis of fraud detection. Neither a Benford check nor a TSTR AUROC tests whether your synthetic data preserves any of them.

The benchmark: high fidelity, catastrophic behavioral loss

Sajja formalizes a taxonomy of four behavioral fraud patterns — P1 inter-event timing, P2 burst structure and active lifetime, P3 multi-account shared-infrastructure graph motifs, and P4 velocity-rule trigger rates — and a degradation ratio metric anchored to the real-data noise floor (1.0 means the synthetic data matches real variability; k means k-times worse). Four generators are benchmarked — CTGAN, TVAE, GaussianCopula, and TabularARGN — on the IEEE-CIS Fraud and Amazon Fraud datasets.

The headline result: CTGAN scores the second-highest TSTR AUROC (0.798, near the 0.903 real-data baseline) yet the worst P3 graph-motif degradation in the benchmark — 99.7×. GaussianCopula has the lowest TSTR AUROC (0.523) but a better P3 score (81.6×). No statistical-fidelity or downstream-utility score predicts behavioral fidelity in any consistent direction.

The pattern holds across datasets. On IEEE-CIS (P1, P2, P4), composite degradation ratios range from 24.4× (TVAE, after a conditional-sampling correction) to 39.0× (GaussianCopula) — all four generators fail severely relative to the real-data noise floor. On the Amazon Fraud dataset (P3 graph motifs), the row-independent generators land at 81.6×–99.7×; only TabularARGN's autoregressive architecture improves it, and only to 17.2×.

Crucially, the paper proves this is not a tuning problem. Row-independent generators — the dominant paradigm — are shown to be structurally incapable of reproducing P3 graph motifs (Proposition 1), and to produce non-positive within-entity inter-event-time autocorrelation (Proposition 2). The positive burst fingerprint of a fraud sequence is unachievable regardless of architecture, training-data size, or post-processing.

Why this breaks your detector in production

The consequence is operational, not academic. Take a velocity rule that flags any card with more than three transactions in one hour. In CTGAN synthetic data, the paper measures that rule firing at an absolute rate 0.36 points lower than in real fraud data. A threshold tuned to minimize false positives on the synthetic data is far too permissive when deployed against real fraud — it will trigger at materially higher rates in production than you planned for. The same disconnect applies to graph-based detectors trained on synthetic device-and-IP structure that bears no relationship to real fraud-ring topology.

So you can ship a model that looked excellent on every offline metric and underperforms exactly where it matters — on the behavioral signals that distinguish a card tester from an ordinary cardholder.

Forward generation — and measuring what matters

The root cause is the modeling paradigm. A row-independent generator learns each column's marginal and the pairwise correlations between columns, then samples each transaction independently. Within-entity sequences — bursts, velocity, shared infrastructure — are never represented, so they cannot be reproduced. VynFi takes the opposite approach. It generates forward from a fully specified process: entities, accounts, and counterparties exist first, and transactions are produced as sequences with real recurring structure, burst dynamics, and multi-account topology. The behavioral signals are present by construction, every one carrying a ground-truth label.

And VynFi measures behavioral fidelity rather than assuming it. The engine evaluates the same P1–P4 family on what it generates, so behavioral fidelity is a number on the dataset's Fidelity Report Card — alongside Benford conformity, coherence validators, and ground-truth label provenance — not a claim you have to take on faith.

Want the receipts? Read the benchmark (arXiv:2604.13125), then generate a labeled fraud dataset with your free signup credits — the Proof Sheet reports the behavioral signals alongside the statistical and coherence checks.

Four things to do before you trust synthetic fraud data

Stop treating TSTR AUROC as sufficient. It averages over all transactions equally and does not test velocity-rule calibration, sequence-level anomaly scoring, or graph ring detection.
Measure velocity-rule trigger rates on your synthetic data and compare them to real fraud. If they differ, every threshold you tune offline will be miscalibrated in production.
Check graph-motif structure (shared device/IP subgraphs), not just node and edge counts. Row-independent generators cannot reproduce it.
Prefer generators that model sequences and entities, not independent rows — and that report a behavioral-fidelity number you can inspect.

Ready to try VynFi?

Start generating synthetic financial data with 5,000 free credits. No credit card required.