Multi-class fraud typology: when `is_fraud` isn't enough for ML training data
DataSynth 5.27 added a `fraud_type` column on the je_network edge list — a fine-grained typology that goes beyond the binary `is_fraud` flag. Here's the ACFE-aligned taxonomy, why multi-class labels improve ML model training, and how VynFi surfaces it.
Binary fraud labels are 80% of what an ML pipeline needs and 100% of what most production datasets ship. The remaining 20% — knowing what *kind* of fraud — is where multi-class models earn their keep. Treating 'management override' and 'expense-reimbursement padding' as the same class trains a model that flags both, but doesn't help an investigator triage what to look at first. DS 5.27 fills the gap.
**TL;DR** — The je_network edge list (the graph view of debit/credit flows VynFi surfaces in the JE network visualizer) now carries a `fraud_type` column. Empty string on non-fraud edges; one of the ACFE-aligned typology codes on fraud edges. Five concrete codes ship: `management_override`, `revenue_recognition`, `expense_fictitious`, `journal_entry_manipulation`, `kickback_scheme`. The portal surfaces it in the JE-network edge detail panel and in the edge-tooltip when present.
The ACFE alignment
The Association of Certified Fraud Examiners (ACFE) Report to the Nations taxonomy is the industry-standard framework for classifying occupational fraud. It splits into three top-level branches — Financial Statement Fraud, Asset Misappropriation, Corruption — with 20+ sub-schemes. DS 5.27's fraud_type codes map to the most-frequent fraud schemes by occurrence rate:
- **`management_override`** — Financial Statement Fraud / Override of Controls. The auditor's persistent worry. SAS 99 / AS 2401 specifically require auditors to address this risk. Generated as anomalous postings that bypass standard segregation-of-duties.
- **`revenue_recognition`** — Financial Statement Fraud / Improper Revenue Recognition. Premature recognition, channel-stuffing, round-tripping. Most-litigated fraud type per ACFE.
- **`expense_fictitious`** — Asset Misappropriation / Fraudulent Disbursements / Expense Reimbursements. Padded expense reports, fictitious vendor invoices.
- **`journal_entry_manipulation`** — Financial Statement Fraud / Improper Journal Entries. Year-end adjustments without business purpose, manual JEs to round-figure amounts, postings outside business hours.
- **`kickback_scheme`** — Corruption / Bribery / Kickbacks. Vendor relationships with above-market pricing, recurring round-figure payments to a single counterparty.
Non-fraud JE edges carry empty string in `fraud_type` — distinct from a 'fraud' edge with an unspecified type (which would carry one of the typology codes).
Why multi-class matters
Three concrete reasons binary `is_fraud` is insufficient for production ML:
1. Triage prioritisation
An investigator with a hundred flagged transactions needs to know which to look at first. A model that says 'edge 47291 is fraud, confidence 0.91' is useful. A model that says 'edge 47291 is `management_override`, confidence 0.91' tells the investigator to escalate immediately (override fraud is high-impact, often-litigated). A model that says 'edge 47291 is `expense_fictitious`, confidence 0.91' lets the investigator route to a junior team for receipt verification.
2. Imbalanced-class learning
Different fraud schemes occur at radically different prevalences. ACFE reports `expense_fictitious` at ~5x the frequency of `management_override` but only ~0.1x the median dollar impact. A binary classifier weights these by the same loss function and over-fits to the high-volume class. A multi-class classifier with class weights (or focal-loss) tuned to the impact-weighted prevalence learns the rare-but-expensive schemes properly.
3. Model evaluation by type
Aggregate AUC tells you the model works in expectation. Per-class precision/recall tells you whether the model is missing `management_override` (the SAS 99 requirement) while excelling at `expense_fictitious` (the auditor's daily work). Models tend to ship into production with high aggregate metrics and silent failures on the highest-impact classes; multi-class labels make those failures visible.
Where the labels come from
Inside DataSynth, the je_network edge list is built from the JE balance graph. When a JE has been flagged `is_fraud = true` by the upstream fraud-injection logic, the fraud-type lookup chooses one of the typology codes based on the scheme that triggered the flag. The mapping is deterministic for a given seed and configuration — so the same fraud injection produces the same label across reruns.
The fraud scenarios shipped in `datasynth-config` (`fraud_scenarios.yaml`) carry per-scenario typology hints. A retail `shrinkage_fraud` scenario, for example, generates `expense_fictitious` typology labels on the JEs it injects.
How VynFi surfaces it
The portal's JE-network visualizer (the D3 force-directed graph in /dashboard/groups/{id}/runs/{run_id}) renders fraud edges in red. As of the DS 5.27 adoption (PR #2), the visualizer also:
- Surfaces a `Fraud type` field in the clicked-edge detail panel when both `is_fraud` and `fraud_type` are present. Non-fraud edges hide the field (rather than showing a blank '—').
- Includes the typology in the edge tooltip: `FRAUD (management_override)` instead of just `FRAUD`. Helps an investigator hover-triage without opening the panel.
- Column-name-resolving parser — additive columns from future DS releases stay non-breaking (regression-pinned).
For ML pipelines that train against VynFi data directly (HF datasets `vynfi/je-fraud-gnn`, etc.), the `fraud_type` column is now a first-class feature on every dataset card we publish.
What's next
Two extensions planned for the next round (5.30+):
- **Per-process fraud rate overrides** — different business processes (P2P, O2C, payroll) have radically different baseline fraud rates. The 5.30 lever lets you tune per-process rates in the API surface, so a fraud-detection model can be trained on a P2P-heavy dataset with the corresponding fraud distribution.
- **Fraud-scheme co-occurrence** — real fraud cases often involve multiple typologies on the same chain (kickback + journal-entry manipulation, e.g.). The single-typology label is a simplification; a multi-label extension is on the roadmap if customers ask for it.