Realistic Synthetic Financial Transactions for Anti-Money Laundering
Models
- URL: http://arxiv.org/abs/2306.16424v3
- Date: Thu, 25 Jan 2024 11:25:09 GMT
- Title: Realistic Synthetic Financial Transactions for Anti-Money Laundering
Models
- Authors: Erik Altman, Jovan Blanu\v{s}a, Luc von Niederh\"ausern, B\'eni
Egressy, Andreea Anghel, Kubilay Atasu
- Abstract summary: Money laundering is the movement of illicit funds to conceal their origins.
The UN estimates 2-5% of global GDP or $0.8 - $2.0 trillion dollars are laundered globally each year.
This paper contributes a synthetic financial transaction dataset generator and a set of synthetically generated AML datasets.
- Score: 2.3802629107286046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the widespread digitization of finance and the increasing popularity of
cryptocurrencies, the sophistication of fraud schemes devised by cybercriminals
is growing. Money laundering -- the movement of illicit funds to conceal their
origins -- can cross bank and national boundaries, producing complex
transaction patterns. The UN estimates 2-5\% of global GDP or \$0.8 - \$2.0
trillion dollars are laundered globally each year. Unfortunately, real data to
train machine learning models to detect laundering is generally not available,
and previous synthetic data generators have had significant shortcomings. A
realistic, standardized, publicly-available benchmark is needed for comparing
models and for the advancement of the area.
To this end, this paper contributes a synthetic financial transaction dataset
generator and a set of synthetically generated AML (Anti-Money Laundering)
datasets. We have calibrated this agent-based generator to match real
transactions as closely as possible and made the datasets public. We describe
the generator in detail and demonstrate how the datasets generated can help
compare different machine learning models in terms of their AML abilities. In a
key way, using synthetic data in these comparisons can be even better than
using real data: the ground truth labels are complete, whilst many laundering
transactions in real data are never detected.
Related papers
- Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings.
Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z) - How Realistic Is Your Synthetic Data? Constraining Deep Generative
Models for Tabular Data [57.97035325253996]
We show how Constrained Deep Generative Models (C-DGMs) can be transformed into realistic synthetic data models.
C-DGMs are able to exploit the background knowledge expressed by the constraints to outperform their standard counterparts.
arXiv Detail & Related papers (2024-02-07T13:22:05Z) - Towards a Foundation Purchasing Model: Pretrained Generative
Autoregression on Transaction Sequences [0.0]
We present a generative pretraining method that can be used to obtain contextualised embeddings of financial transactions.
We additionally perform large-scale pretraining of an embedding model using a corpus of data from 180 issuing banks containing 5.1 billion transactions.
arXiv Detail & Related papers (2024-01-03T09:32:48Z) - FinDiff: Diffusion Models for Financial Tabular Data Generation [5.824064631226058]
FinDiff is a diffusion model designed to generate real-world financial data for a variety of regulatory downstream tasks.
It is evaluated against state-of-the-art baseline models using three real-world financial datasets.
arXiv Detail & Related papers (2023-09-04T09:30:15Z) - From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition [64.59093444558549]
We propose a simple, easy-to-implement, two-step training pipeline that we call From Fake to Real.
By training on real and synthetic data separately, FFR does not expose the model to the statistical differences between real and synthetic data.
Our experiments show that FFR improves worst group accuracy over the state-of-the-art by up to 20% over three datasets.
arXiv Detail & Related papers (2023-08-08T19:52:28Z) - Synthetic Demographic Data Generation for Card Fraud Detection Using
GANs [4.651915393462367]
We build a deep-learning Generative Adversarial Network (GAN), called DGGAN, which will be used for demographic data generation.
Our model generates samples during model training, which we found important to overcame class imbalance issues.
arXiv Detail & Related papers (2023-06-29T17:08:57Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Mechanisms that Incentivize Data Sharing in Federated Learning [90.74337749137432]
We show how a naive scheme leads to catastrophic levels of free-riding where the benefits of data sharing are completely eroded.
We then introduce accuracy shaping based mechanisms to maximize the amount of data generated by each agent.
arXiv Detail & Related papers (2022-07-10T22:36:52Z) - Anti-Money Laundering Alert Optimization Using Machine Learning with
Graphs [0.769672852567215]
Money laundering is a global problem that concerns legitimizing proceeds from serious felonies (1.7-4 trillion euros annually)
We propose a machine learning triage model, which complements the rule-based system and learns to predict the risk of an alert accurately.
We validate our model on a real-world banking dataset and show how the triage model can reduce the number of false positives by 80% while detecting over 90% of true positives.
arXiv Detail & Related papers (2021-12-14T16:12:30Z) - Generating synthetic transactional profiles [0.0]
In this paper, we generate synthetic transactional profiles using machine learning techniques.
We measured data utility by calculating common insights used by the banking industry on both the original and the synthetic data-set.
arXiv Detail & Related papers (2021-10-28T18:52:04Z) - DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative
Networks [71.6879432974126]
We introduce DECAF: a GAN-based fair synthetic data generator for tabular data.
We show that DECAF successfully removes undesired bias and is capable of generating high-quality synthetic data.
We provide theoretical guarantees on the generator's convergence and the fairness of downstream models.
arXiv Detail & Related papers (2021-10-25T12:39:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.