Evaluating Transfer Learning Methods on Real-World Data Streams: A Case Study in Financial Fraud Detection
- URL: http://arxiv.org/abs/2508.02702v1
- Date: Tue, 29 Jul 2025 14:12:21 GMT
- Title: Evaluating Transfer Learning Methods on Real-World Data Streams: A Case Study in Financial Fraud Detection
- Authors: Ricardo Ribeiro Pereira, Jacopo Bono, Hugo Ferreira, Pedro Ribeiro, Carlos Soares, Pedro Bizarro,
- Abstract summary: When the available data for a target domain is limited, transfer learning (TL) methods can be used to develop models on related data-rich domains.<n>We propose a data manipulation framework that simulates varying data availability scenarios over time.<n>We demonstrate the usefulness of the proposed framework by performing a case study on a proprietary real-world suite of card payment datasets.
- Score: 4.689506737427387
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: When the available data for a target domain is limited, transfer learning (TL) methods can be used to develop models on related data-rich domains, before deploying them on the target domain. However, these TL methods are typically designed with specific, static assumptions on the amount of available labeled and unlabeled target data. This is in contrast with many real world applications, where the availability of data and corresponding labels varies over time. Since the evaluation of the TL methods is typically also performed under the same static data availability assumptions, this would lead to unrealistic expectations concerning their performance in real world settings. To support a more realistic evaluation and comparison of TL algorithms and models, we propose a data manipulation framework that (1) simulates varying data availability scenarios over time, (2) creates multiple domains through resampling of a given dataset and (3) introduces inter-domain variability by applying realistic domain transformations, e.g., creating a variety of potentially time-dependent covariate and concept shifts. These capabilities enable simulation of a large number of realistic variants of the experiments, in turn providing more information about the potential behavior of algorithms when deployed in dynamic settings. We demonstrate the usefulness of the proposed framework by performing a case study on a proprietary real-world suite of card payment datasets. Given the confidential nature of the case study, we also illustrate the use of the framework on the publicly available Bank Account Fraud (BAF) dataset. By providing a methodology for evaluating TL methods over time and in realistic data availability scenarios, our framework facilitates understanding of the behavior of models and algorithms. This leads to better decision making when deploying models for new domains in real-world environments.
Related papers
- Model-Free Counterfactual Subset Selection at Scale [11.646993755965006]
Streaming explanations offer adaptive, real-time insights without requiring persistent storage of the entire dataset.<n>Our algorithm operates efficiently in streaming settings, maintaining $O(log k)$ update complexity per item.<n> Empirical evaluations on both real-world and synthetic datasets demonstrate superior performance over baseline methods.
arXiv Detail & Related papers (2025-02-12T11:48:15Z) - Generate to Discriminate: Expert Routing for Continual Learning [59.71853576559306]
Generate to Discriminate (G2D) is a continual learning method that leverages synthetic data to train a domain-discriminator.<n>We observe that G2D outperforms competitive domain-incremental learning methods on tasks in both vision and language modalities.
arXiv Detail & Related papers (2024-12-22T13:16:28Z) - Testing Generalizability in Causal Inference [3.547529079746247]
No formal procedure exists for statistically evaluating generalizability in machine learning algorithms.<n>We propose a systematic framework for statistically evaluating the generalizability of high-dimensional causal inference models.
arXiv Detail & Related papers (2024-11-05T11:44:00Z) - SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation On Diverse Modalities [55.87169702896249]
Unsupervised Domain Adaptation (DA) consists of adapting a model trained on a labeled source domain to perform well on an unlabeled target domain with some data distribution shift.<n>We present a complete and fair evaluation of existing shallow algorithms, including reweighting, mapping, and subspace alignment.<n>Our benchmark highlights the importance of realistic validation and provides practical guidance for real-life applications.
arXiv Detail & Related papers (2024-07-16T12:52:29Z) - Cross-user activity recognition via temporal relation optimal transport [0.0]
Current research on human activity recognition (HAR) mainly assumes that training and testing data are drawn from the same distribution to achieve a generalised model.
We propose the temporal relation optimal transport (TROT) method to utilise temporal relation and relax the $displaystyle i.i.d. $ assumption.
arXiv Detail & Related papers (2024-03-12T22:33:56Z) - SALUDA: Surface-based Automotive Lidar Unsupervised Domain Adaptation [62.889835139583965]
We introduce an unsupervised auxiliary task of learning an implicit underlying surface representation simultaneously on source and target data.
As both domains share the same latent representation, the model is forced to accommodate discrepancies between the two sources of data.
Our experiments demonstrate that our method achieves a better performance than the current state of the art, both in real-to-real and synthetic-to-real scenarios.
arXiv Detail & Related papers (2023-04-06T17:36:23Z) - One-Shot Domain Adaptive and Generalizable Semantic Segmentation with
Class-Aware Cross-Domain Transformers [96.51828911883456]
Unsupervised sim-to-real domain adaptation (UDA) for semantic segmentation aims to improve the real-world test performance of a model trained on simulated data.
Traditional UDA often assumes that there are abundant unlabeled real-world data samples available during training for the adaptation.
We explore the one-shot unsupervised sim-to-real domain adaptation (OSUDA) and generalization problem, where only one real-world data sample is available.
arXiv Detail & Related papers (2022-12-14T15:54:15Z) - TAL: Two-stream Adaptive Learning for Generalizable Person
Re-identification [115.31432027711202]
We argue that both domain-specific and domain-invariant features are crucial for improving the generalization ability of re-id models.
We name two-stream adaptive learning (TAL) to simultaneously model these two kinds of information.
Our framework can be applied to both single-source and multi-source domain generalization tasks.
arXiv Detail & Related papers (2021-11-29T01:27:42Z) - Inferring Latent Domains for Unsupervised Deep Domain Adaptation [54.963823285456925]
Unsupervised Domain Adaptation (UDA) refers to the problem of learning a model in a target domain where labeled data are not available.
This paper introduces a novel deep architecture which addresses the problem of UDA by automatically discovering latent domains in visual datasets.
We evaluate our approach on publicly available benchmarks, showing that it outperforms state-of-the-art domain adaptation methods.
arXiv Detail & Related papers (2021-03-25T14:33:33Z) - Machine Learning for Temporal Data in Finance: Challenges and
Opportunities [0.0]
Temporal data are ubiquitous in the financial services (FS) industry.
But machine learning efforts often fail to account for the temporal richness of these data.
arXiv Detail & Related papers (2020-09-11T19:39:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.