Hybrid Data can Enhance the Utility of Synthetic Data for Training Anti-Money Laundering Models
- URL: http://arxiv.org/abs/2509.18499v1
- Date: Tue, 23 Sep 2025 01:03:23 GMT
- Title: Hybrid Data can Enhance the Utility of Synthetic Data for Training Anti-Money Laundering Models
- Authors: Rachel Chung, Pratyush Nidhi Sharma, Mikko Siponen, Rohit Vadodaria, Luke Smith,
- Abstract summary: A major issue for developing such models is the lack of access to training data due to privacy and confidentiality concerns.<n>This article proposes the use of hybrid datasets to augment the utility of synthetic datasets by incorporating publicly available, easily accessible, and real-world features.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Money laundering is a critical global issue for financial institutions. Automated Anti-money laundering (AML) models, like Graph Neural Networks (GNN), can be trained to identify illicit transactions in real time. A major issue for developing such models is the lack of access to training data due to privacy and confidentiality concerns. Synthetically generated data that mimics the statistical properties of real data but preserves privacy and confidentiality has been proposed as a solution. However, training AML models on purely synthetic datasets presents its own set of challenges. This article proposes the use of hybrid datasets to augment the utility of synthetic datasets by incorporating publicly available, easily accessible, and real-world features. These additions demonstrate that hybrid datasets not only preserve privacy but also improve model utility, offering a practical pathway for financial institutions to enhance AML systems.
Related papers
- Generative Models for Synthetic Data: Transforming Data Mining in the GenAI Era [49.46005489386284]
This tutorial introduces the foundations and latest advances in synthetic data generation.<n> Attendees will gain actionable insights into leveraging generative synthetic data to enhance data mining research and practice.
arXiv Detail & Related papers (2025-08-27T05:04:07Z) - PuckTrick: A Library for Making Synthetic Data More Realistic [46.198289193451146]
We introduce Pucktrick, a Python library designed to systematically contaminate synthetic datasets by introducing controlled errors.<n>We evaluate the impact of systematic data contamination on model performance.<n>Our findings demonstrate that ML models trained on contaminated synthetic data outperform those trained on purely synthetic, error-free data.
arXiv Detail & Related papers (2025-06-23T10:51:45Z) - The Canary's Echo: Auditing Privacy Risks of LLM-Generated Synthetic Text [23.412546862849396]
We assume an adversary has access to some synthetic data generated by a Large Language Models (LLMs)<n>We design membership inference attacks (MIAs) that target the training data used to fine-tune the LLM that is then used to synthesize data.<n>We find that canaries crafted for model-based MIAs are sub-optimal for privacy auditing when only synthetic data is released.
arXiv Detail & Related papers (2025-02-19T15:30:30Z) - SafeSynthDP: Leveraging Large Language Models for Privacy-Preserving Synthetic Data Generation Using Differential Privacy [0.0]
We investigate capability of Large Language Models (Ms) to generate synthetic datasets with Differential Privacy (DP) mechanisms.<n>Our approach incorporates DP-based noise injection methods, including Laplace and Gaussian distributions, into the data generation process.<n>We then evaluate the utility of these DP-enhanced synthetic datasets by comparing the performance of ML models trained on them against models trained on the original data.
arXiv Detail & Related papers (2024-12-30T01:10:10Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - FinDiff: Diffusion Models for Financial Tabular Data Generation [5.824064631226058]
FinDiff is a diffusion model designed to generate real-world financial data for a variety of regulatory downstream tasks.
It is evaluated against state-of-the-art baseline models using three real-world financial datasets.
arXiv Detail & Related papers (2023-09-04T09:30:15Z) - The Use of Synthetic Data to Train AI Models: Opportunities and Risks
for Sustainable Development [0.6906005491572401]
This paper investigates the policies governing the creation, utilization, and dissemination of synthetic data.
A well crafted synthetic data policy must strike a balance between privacy concerns and the utility of data.
arXiv Detail & Related papers (2023-08-31T23:18:53Z) - Realistic Synthetic Financial Transactions for Anti-Money Laundering
Models [2.3802629107286046]
Money laundering is the movement of illicit funds to conceal their origins.
The UN estimates 2-5% of global GDP or $0.8 - $2.0 trillion dollars are laundered globally each year.
This paper contributes a synthetic financial transaction dataset generator and a set of synthetically generated AML datasets.
arXiv Detail & Related papers (2023-06-22T10:32:51Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - Membership Inference Attacks against Synthetic Data through Overfitting
Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution.
We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.