Tide: A Customisable Dataset Generator for Anti-Money Laundering Research
- URL: http://arxiv.org/abs/2603.01863v1
- Date: Mon, 02 Mar 2026 13:44:18 GMT
- Title: Tide: A Customisable Dataset Generator for Anti-Money Laundering Research
- Authors: Montijn van den Beukel, Jože Martin Rožanec, Ana-Lucia Varbanescu,
- Abstract summary: We present Tide, an open-source synthetic dataset generator.<n>It produces graph-based financial networks incorporating money laundering patterns.<n>Tide enables reproducible, customisable dataset generation tailored to specific research needs.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The lack of accessible transactional data significantly hinders machine learning research for Anti-Money Laundering (AML). Privacy and legal concerns prevent the sharing of real financial data, while existing synthetic generators focus on simplistic structural patterns and neglect the temporal dynamics (timing and frequency) that characterise sophisticated laundering schemes. We present Tide, an open-source synthetic dataset generator that produces graph-based financial networks incorporating money laundering patterns defined by both structural and temporal characteristics. Tide enables reproducible, customisable dataset generation tailored to specific research needs. We release two reference datasets with varying illicit ratios (LI: 0.10\%, HI: 0.19\%), alongside the implementation of state-of-the-art detection models. Evaluation across these datasets reveals condition-dependent model rankings: LightGBM achieves the highest PR-AUC (78.05) in the low illicit ratio condition, while XGBoost performs best (85.12) at higher fraud prevalence. These divergent rankings demonstrate that the reference datasets can meaningfully differentiate model capabilities across operational conditions. Tide provides the research community with a configurable benchmark that exposes meaningful performance variation across model architectures, advancing the development of robust AML detection methods.
Related papers
- Synthetic Financial Data Generation for Enhanced Financial Modelling [0.0]
This paper presents a unified multi-criteria evaluation framework for synthetic financial data.<n>Using historical S and P 500 daily data, we evaluate fidelity (Maximum Mean Discrepancy, MMD), temporal structure (autocorrelation and volatility clustering), and practical utility in downstream tasks.<n>We articulate practical guidelines for selecting generative models according to application needs and computational constraints.
arXiv Detail & Related papers (2025-12-25T21:43:16Z) - Dynamic Evaluation for Oversensitivity in LLMs [68.27609301865174]
Oversensitivity occurs when language models defensively reject prompts that are actually benign.<n>This behavior not only disrupts user interactions but also obscures the boundary between harmful and harmless content.<n>Existing benchmarks rely on static datasets that degrade overtime as models evolve.
arXiv Detail & Related papers (2025-10-21T18:33:47Z) - Estimating Time Series Foundation Model Transferability via In-Context Learning [74.65355820906355]
Time series foundation models (TSFMs) offer strong zero-shot forecasting via large-scale pre-training.<n>Fine-tuning remains critical for boosting performance in domains with limited public data.<n>We introduce TimeTic, a transferability estimation framework that recasts model selection as an in-context-learning problem.
arXiv Detail & Related papers (2025-09-28T07:07:13Z) - MPOCryptoML: Multi-Pattern based Off-Chain Crypto Money Laundering Detection [2.2530496464901106]
We propose MPOCryptoML to effectively detect multiple laundering patterns in cryptocurrency transactions.<n>MPOCryptoML includes the development of a multi-source Personalized PageRank algorithm to identify random laundering patterns.<n>We show consistent performance gains, with improvements up to 9.13% in precision, up to 10.16% in recall, up to 7.63% in F1-score, and up to 10.19% in accuracy.
arXiv Detail & Related papers (2025-08-18T06:06:32Z) - Evaluating Privacy-Utility Tradeoffs in Synthetic Smart Grid Data [9.927400227483428]
We conduct a comparative evaluation of four synthetic data generation methods.<n>We assess classification utility, distribution fidelity, and privacy leakage.<n>These findings highlight the potential of structured generative models for developing privacy-preserving, data-driven energy systems.
arXiv Detail & Related papers (2025-05-20T10:46:29Z) - CBW: Towards Dataset Ownership Verification for Speaker Verification via Clustering-based Backdoor Watermarking [85.68235482145091]
Large-scale speech datasets have become valuable intellectual property.<n>We propose a novel dataset ownership verification method.<n>Our approach introduces a clustering-based backdoor watermark (CBW)<n>We conduct extensive experiments on benchmark datasets, verifying the effectiveness and robustness of our method against potential adaptive attacks.
arXiv Detail & Related papers (2025-03-02T02:02:57Z) - Model Stealing Attack against Graph Classification with Authenticity, Uncertainty and Diversity [80.16488817177182]
GNNs are vulnerable to the model stealing attack, a nefarious endeavor geared towards duplicating the target model via query permissions.
We introduce three model stealing attacks to adapt to different actual scenarios.
arXiv Detail & Related papers (2023-12-18T05:42:31Z) - Test-Time Adaptation Induces Stronger Accuracy and Agreement-on-the-Line [65.14099135546594]
Recent test-time adaptation (TTA) methods drastically strengthen the ACL and AGL trends in models, even in shifts where models showed very weak correlations before.
Our results show that by combining TTA with AGL-based estimation methods, we can estimate the OOD performance of models with high precision for a broader set of distribution shifts.
arXiv Detail & Related papers (2023-10-07T23:21:25Z) - Realistic Synthetic Financial Transactions for Anti-Money Laundering
Models [2.3802629107286046]
Money laundering is the movement of illicit funds to conceal their origins.
The UN estimates 2-5% of global GDP or $0.8 - $2.0 trillion dollars are laundered globally each year.
This paper contributes a synthetic financial transaction dataset generator and a set of synthetically generated AML datasets.
arXiv Detail & Related papers (2023-06-22T10:32:51Z) - CausalAgents: A Robustness Benchmark for Motion Forecasting using Causal
Relationships [8.679073301435265]
We construct a new benchmark for evaluating and improving model robustness by applying perturbations to existing data.
We use these labels to perturb the data by deleting non-causal agents from the scene.
Under non-causal perturbations, we observe a $25$-$38%$ relative change in minADE as compared to the original.
arXiv Detail & Related papers (2022-07-07T21:28:23Z) - Generative Modeling Helps Weak Supervision (and Vice Versa) [87.62271390571837]
We propose a model fusing weak supervision and generative adversarial networks.
It captures discrete variables in the data alongside the weak supervision derived label estimate.
It is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels.
arXiv Detail & Related papers (2022-03-22T20:24:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.