Build a Fraud Detector in 30 Minutes with Python
Generate fully labeled fraud data, engineer features, train a RandomForest classifier, and compare it against rule-based audit analytics — all in one notebook session.
Labeled fraud data is the most valuable and most inaccessible resource in financial machine learning. In the real world, obtaining ground-truth fraud labels requires forensic investigation, whistleblower disclosures, or regulatory enforcement actions. A single confirmed fraud case can take years to surface and involves highly restricted evidentiary records that almost never reach data science teams.
This is the ground truth impossibility: you cannot build a supervised fraud detector without labeled examples, and you cannot get labeled examples without catching fraudsters first. VynFi breaks that cycle. Every journal entry it generates carries an is_fraud boolean and a fraud_type string — ground-truth labels that are structurally guaranteed, not inferred.
This tutorial walks through the complete workflow from the VynFi fraud detection lab notebook: generate paired clean and fraud-injected datasets, flatten them into a usable DataFrame, engineer detection features, train a RandomForest classifier, evaluate it with a confusion matrix, and compare it against five classic rule-based audit analytics tests.
**Update (2026-04-19, DataSynth 3.1.1):** Fraud-labeled entries now carry **verified behavioural signal** — weekend posting ×32, round-dollar amounts ×170, post-close dates ×3,106 lift on fraud vs non-fraud populations. Feature-importance scores on those signals jump from ~0 (3.0.x) to ~5–15% of total RF importance. The new `is_fraud_propagated` header flag distinguishes scheme-level fraud (document ring fan-out) from direct line-level injection — train a stratified detector via the SDK's `client.jobs.fraud_split(job_id)` or the ml_training_pipeline.py example. For the canonical feature set, behavioral_fraud_patterns.py verifies the lift end-to-end against a fresh job.
Generate Clean and Fraud-Injected Datasets
The lab uses two generation jobs from the same sector and schema — a clean baseline with no fraud, and a fraud-injected dataset with revenue_fraud at a 5% injection rate. Running them side by side lets you measure how fraud distorts statistical properties and give your classifier unambiguous positive and negative examples.
import osimport vynficlient = vynfi.VynFi(api_key=os.environ["VYNFI_API_KEY"])base_config = { "sector": "retail", "country": "US", "accountingFramework": "us_gaap", "rows": 1000, "companies": 5, "periods": 3, "periodLength": "monthly", "processModels": ["o2c", "p2p"], "exportFormat": "json",}# Clean baseline — no fraud injectedclean_job = client.jobs.generate_config(config={**base_config, "fraudPacks": [], "fraudRate": 0.0})print(f"Clean job submitted: {clean_job.id}")# Fraud dataset — revenue_fraud at 5%fraud_job = client.jobs.generate_config(config={**base_config, "fraudPacks": ["revenue_fraud"], "fraudRate": 0.05})print(f"Fraud job submitted: {fraud_job.id}")# Wait for bothclean_done = client.jobs.wait(clean_job.id, timeout=600.0)fraud_done = client.jobs.wait(fraud_job.id, timeout=600.0)clean_archive = client.jobs.download_archive(clean_done.id)fraud_archive = client.jobs.download_archive(fraud_done.id)Flatten to a Training DataFrame
The archive stores journal entries as header-plus-lines documents. The flatten_entries function merges each header's fraud labels down onto every line item, producing one row per line with is_fraud and fraud_type as columns. The resulting DataFrame is your training corpus.
import pandas as pddef flatten_entries(entries, dataset_label=""): """Flatten journal entry headers + lines into a single DataFrame.""" rows = [] for entry in entries: hdr = entry["header"] header = { "entry_id": hdr.get("entry_id") or hdr.get("id"), "date": hdr.get("posting_date") or hdr.get("date"), "is_manual": hdr.get("is_manual", False), "is_post_close": hdr.get("is_post_close", False), "is_fraud": hdr.get("is_fraud", False), "fraud_type": hdr.get("fraud_type"), "created_by": hdr.get("created_by", ""), "approved_by": hdr.get("approved_by", ""), "document_type": hdr.get("document_type", ""), } for line in entry.get("lines", []): debit = float(pd.to_numeric(line.get("debit_amount", 0), errors="coerce") or 0) credit = float(pd.to_numeric(line.get("credit_amount", 0), errors="coerce") or 0) rows.append({ **header, "account_code": line.get("account_code") or line.get("gl_account", ""), "debit": debit, "credit": credit, "amount": debit + credit, }) df = pd.DataFrame(rows) df["dataset"] = dataset_label df["date"] = pd.to_datetime(df["date"], errors="coerce") return dfclean_entries = clean_archive.json("journal_entries.json")fraud_entries = fraud_archive.json("journal_entries.json")clean_df = flatten_entries(clean_entries, "clean")fraud_df = flatten_entries(fraud_entries, "fraud")print(f"Clean rows: {len(clean_df):,}")print(f"Fraud rows: {len(fraud_df):,}")# Verify labelsfraud_rate = fraud_df.groupby("entry_id")["is_fraud"].first().mean()print(f"Observed fraud entry rate: {fraud_rate:.1%}")The clean dataset should show exactly zero fraud-labeled entries. If it does not, the generation config has overlapping fraud packs — check that fraudRate is explicitly set to 0.0 on the clean job.
Feature Engineering
Good fraud detection depends on features that capture signals auditors look for manually. The engineer_features function adds five categories: amount features (log scale, round-number flag, z-score), timing features (weekend and month-end flags, day of week), account features (frequency, unusual debit-credit combo rarity), process flags (manual entry, post-close), and document type encoding.
import numpy as npdef engineer_features(df): df = df.copy() # Amount features df["log_amount"] = np.log1p(df["amount"]) df["is_round"] = ((df["amount"] % 1000 == 0) & (df["amount"] > 0)).astype(int) mu, sigma = df["amount"].mean(), df["amount"].std() df["amount_zscore"] = (df["amount"] - mu) / sigma if sigma > 0 else 0.0 # Timing features if df["date"].notna().any(): df["day_of_week"] = df["date"].dt.dayofweek df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int) df["is_month_end"] = df["date"].dt.is_month_end.astype(int) else: df[["day_of_week", "is_weekend", "is_month_end"]] = 0 # Account frequency (rare accounts = higher risk) acct_freq = df["account_code"].value_counts() df["account_frequency"] = df["account_code"].map(acct_freq) # Unusual account combination rarity entry_combos = df.groupby("entry_id")["account_code"].apply( lambda x: "|".join(sorted(x.unique())) ) combo_freq = entry_combos.value_counts() df["account_combo_rarity"] = df["entry_id"].map(entry_combos).map(combo_freq) df["unusual_combo"] = (df["account_combo_rarity"] <= 2).astype(int) # Process flags df["is_manual_int"] = df["is_manual"].astype(int) df["is_post_close_int"] = df["is_post_close"].astype(int) df["doc_type_encoded"] = df["document_type"].astype("category").cat.codes return dffraud_df = engineer_features(fraud_df)Train a RandomForest Classifier
With labeled features ready, training is straightforward. The fraud detection notebook uses a RandomForest with class_weight='balanced' to handle the class imbalance without oversampling. The model is evaluated on a held-out test set, and feature importance shows which signals matter most.
from sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import classification_report, precision_recall_fscore_supportFEATURE_COLS = [ "log_amount", "is_round", "amount_zscore", "day_of_week", "is_weekend", "is_month_end", "account_frequency", "unusual_combo", "is_manual_int", "is_post_close_int", "doc_type_encoded",]X = fraud_df[FEATURE_COLS].fillna(0)y = fraud_df["is_fraud"].astype(int)X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=42)clf = RandomForestClassifier( n_estimators=200, class_weight="balanced", max_depth=12, random_state=42, n_jobs=-1,)clf.fit(X_train, y_train)y_pred = clf.predict(X_test)print(classification_report(y_test, y_pred, target_names=["Legitimate", "Fraud"]))On a typical 5% fraud-rate dataset you will see precision around 70-75% and recall around 80-85% for the fraud class. The most important features are usually log_amount (large entries are suspicious), is_post_close (post-close entries bypass controls), and is_round (round numbers are a classic red flag). is_weekend also ranks highly because fraudulent manual entries often appear outside business hours.
Benford's Law Analysis
Benford's Law states that in naturally occurring financial data, the leading digit '1' appears about 30.1% of the time, '2' about 17.6%, and so on. Fabricated numbers tend to follow more uniform distributions. The lab measures the mean absolute deviation (MAD) from Benford's expected distribution for fraud versus legitimate entries separately.
BENFORD = {1: 0.301, 2: 0.176, 3: 0.125, 4: 0.097, 5: 0.079, 6: 0.067, 7: 0.058, 8: 0.051, 9: 0.046}def leading_digit_dist(amounts): pos = amounts[amounts > 0] digits = pos.apply(lambda x: int(str(x).lstrip("0.")[0]) if x > 0 else 0) digits = digits[digits > 0] freq = digits.value_counts().sort_index() / len(digits) return freq.reindex(range(1, 10), fill_value=0.0)legit = fraud_df[~fraud_df["is_fraud"]]fraud = fraud_df[fraud_df["is_fraud"]]dist_legit = leading_digit_dist(legit["amount"])dist_fraud = leading_digit_dist(fraud["amount"])expected = pd.Series(BENFORD)mad_legit = (dist_legit - expected).abs().mean()mad_fraud = (dist_fraud - expected).abs().mean()print(f"Benford MAD -- Legitimate: {mad_legit:.4f}")print(f"Benford MAD -- Fraud: {mad_fraud:.4f}")print(f"Fraud deviates {mad_fraud / mad_legit:.1f}x more than legitimate entries")In the lab's output, legitimate entries typically show a Benford MAD around 0.004 — well within the range you would see from real financial data. Fraud-injected entries show MAD around 0.020-0.030, roughly 5-7x higher. This confirms that VynFi's fraud injection meaningfully breaks the Benford distribution, giving your model a real signal to learn from.
Rule-Based Detection Comparison
Before ML, auditors used — and still use — rule-based analytics to flag suspicious entries. These rules are interpretable, auditable, and often required by regulators. The lab measures five classic rules against the same labeled dataset, giving you a direct comparison with the ML model.
# Five classic audit analytics rulesfraud_df["rule_round"] = (fraud_df["amount"] % 1000 == 0) & (fraud_df["amount"] > 0)fraud_df["rule_weekend"] = fraud_df["is_weekend"] == 1fraud_df["rule_duplicate"] = fraud_df.groupby( ["date", "account_code", "amount"])["amount"].transform("size") > 1fraud_df["rule_unusual"] = fraud_df["unusual_combo"] == 1fraud_df["rule_post_close"] = fraud_df["is_post_close"] == Truerules = { "Round Amount (mod 1000)": "rule_round", "Weekend Posting": "rule_weekend", "Duplicate Entry": "rule_duplicate", "Unusual Account Combo": "rule_unusual", "Post-Close Entry": "rule_post_close",}total_fraud = fraud_df["is_fraud"].sum()print(f"{'Rule':<30} {'Flagged':>8} {'True Pos':>10} {'Precision':>10} {'Recall':>8}")for name, col in rules.items(): flagged = fraud_df[col].sum() true_pos = fraud_df[fraud_df[col] & fraud_df["is_fraud"]].shape[0] prec = true_pos / flagged if flagged > 0 else 0.0 rec = true_pos / total_fraud if total_fraud > 0 else 0.0 print(f"{name:<30} {flagged:>8} {true_pos:>10} {prec:>10.1%} {rec:>8.1%}")Individual rules tend to achieve 40-60% recall at low precision (8-15%). The ML model achieves comparable recall with substantially higher precision because it combines all signals simultaneously. The real insight is that labeled data makes this comparison possible — without ground truth, you would have no way to know which rules are actually catching fraud versus generating noise.
Use the score_detector helper from the lab notebook to run a standardized precision/recall/F1 comparison across all five fraud templates (revenue_fraud, vendor_kickback, payroll_ghost, management_override, comprehensive). Scheme-level measurement is only possible with labeled data.
Building a Reusable Fraud Test Suite
VynFi provides system templates pre-configured for each fraud scheme. Running all five templates produces a detection scorecard that measures how your rules and model perform against each scheme independently. This kind of scheme-level benchmarking is the gold standard for fraud analytics validation, and it is only achievable with labeled synthetic data.
- fraud-revenue-manipulation: fictitious sales, channel stuffing, premature recognition
- fraud-procurement: inflated vendor invoices, shell company payments
- fraud-payroll-ghost: ghost employees, phantom hours, inflated benefits
- fraud-management-override: post-close adjustments, bypassed approvals, round-number overrides
- fraud-multi-scheme: all four combined at varying rates
Next Steps
The 30-minute lab gets you to a working classifier. From here, common improvements include swapping RandomForest for XGBoost or LightGBM, using SMOTE to address class imbalance, adding text features from entry descriptions, and building a real-time scoring pipeline that feeds flagged entries into a case management system. The labeled data from VynFi works the same way at any scale — from a 10K-row development set to a 1M-row production training corpus.
The full notebook is available at 03_fraud_detection_lab.ipynb in the VynFi Python SDK repository. It includes SOD violation analysis, confidence threshold optimization, and a multi-template test suite runner.
**Ready-to-use datasets** on Hugging Face (regenerated with DataSynth 3.1.1): VynFi/vynfi-journal-entries-1m (2.1M line items, manufacturing, ~7% fraud with propagation labels), VynFi/vynfi-audit-p2p (P2P document flow with is_fraud_propagated), and VynFi/vynfi-aml-100k (banking + AML labels). Load directly with `datasets.load_dataset("VynFi/<slug>")`.