FiFAR: A Fraud Detection Dataset for Learning to Defer
- URL: http://arxiv.org/abs/2312.13218v1
- Date: Wed, 20 Dec 2023 17:36:36 GMT
- Title: FiFAR: A Fraud Detection Dataset for Learning to Defer
- Authors: Jean V. Alves, Diogo Leit\~ao, S\'ergio Jesus, Marco O. P. Sampaio,
Pedro Saleiro, M\'ario A. T. Figueiredo, Pedro Bizarro
- Abstract summary: We introduce the Financial Fraud Alert Review dataset (FiFAR), a synthetic bank account fraud detection dataset.
FiFAR contains the predictions of a team of 50 highly complex and varied synthetic fraud analysts, with varied bias and feature dependence.
We use our dataset to develop a capacity-aware L2D method and rejection learning approach under realistic data availability conditions.
- Score: 9.187694794359498
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Public dataset limitations have significantly hindered the development and
benchmarking of learning to defer (L2D) algorithms, which aim to optimally
combine human and AI capabilities in hybrid decision-making systems. In such
systems, human availability and domain-specific concerns introduce
difficulties, while obtaining human predictions for training and evaluation is
costly. Financial fraud detection is a high-stakes setting where algorithms and
human experts often work in tandem; however, there are no publicly available
datasets for L2D concerning this important application of human-AI teaming. To
fill this gap in L2D research, we introduce the Financial Fraud Alert Review
Dataset (FiFAR), a synthetic bank account fraud detection dataset, containing
the predictions of a team of 50 highly complex and varied synthetic fraud
analysts, with varied bias and feature dependence. We also provide a realistic
definition of human work capacity constraints, an aspect of L2D systems that is
often overlooked, allowing for extensive testing of assignment systems under
real-world conditions. We use our dataset to develop a capacity-aware L2D
method and rejection learning approach under realistic data availability
conditions, and benchmark these baselines under an array of 300 distinct
testing scenarios. We believe that this dataset will serve as a pivotal
instrument in facilitating a systematic, rigorous, reproducible, and
transparent evaluation and comparison of L2D methods, thereby fostering the
development of more synergistic human-AI collaboration in decision-making
systems. The public dataset and detailed synthetic expert information are
available at: https://github.com/feedzai/fifar-dataset
Related papers
- Coverage-Constrained Human-AI Cooperation with Multiple Experts [21.247853435529446]
We propose the Coverage-constrained Learning to Defer and Complement with Specific Experts (CL2DC) method.
CL2DC makes final decisions through either AI prediction alone or by deferring to or complementing a specific expert.
It achieves superior performance compared to state-of-the-art HAI-CC methods.
arXiv Detail & Related papers (2024-11-18T19:06:01Z) - Cost-Sensitive Learning to Defer to Multiple Experts with Workload Constraints [10.917274244918985]
Learning to defer aims to improve human-AI collaboration systems by learning how to defer decisions to humans when they are more likely to be correct than an ML classifier.
Existing research in L2D overlooks key real-world aspects that impede its practical adoption.
DeCCaF is a novel L2D approach, employing supervised learning to model the probability of human error under less restrictive data requirements.
arXiv Detail & Related papers (2024-03-11T16:57:20Z) - DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation [83.30006900263744]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights.
We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs.
Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Human-Centric Multimodal Machine Learning: Recent Advances and Testbed
on AI-based Recruitment [66.91538273487379]
There is a certain consensus about the need to develop AI applications with a Human-Centric approach.
Human-Centric Machine Learning needs to be developed based on four main requirements: (i) utility and social good; (ii) privacy and data ownership; (iii) transparency and accountability; and (iv) fairness in AI-driven decision-making processes.
We study how current multimodal algorithms based on heterogeneous sources of information are affected by sensitive elements and inner biases in the data.
arXiv Detail & Related papers (2023-02-13T16:44:44Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - Efficient Realistic Data Generation Framework leveraging Deep
Learning-based Human Digitization [0.0]
The proposed method takes as input real background images and populates them with human figures in various poses.
A benchmarking and evaluation in the corresponding tasks shows that synthetic data can be effectively used as a supplement to real data.
arXiv Detail & Related papers (2021-06-28T08:07:31Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z) - A Human-in-the-Loop Approach based on Explainability to Improve NTL
Detection [0.12183405753834559]
This work explains our human-in-the-loop approach to mitigate problems in a real system that uses a supervised model to detect Non-Technical Losses (NTL)
This approach exploits human knowledge (e.g. from the data scientists or the company's stakeholders) and the information provided by explanatory methods to guide the system during the training process.
The results show that the derived prediction model is better in terms of accuracy, interpretability, robustness and flexibility.
arXiv Detail & Related papers (2020-09-28T16:04:07Z) - Learning while Respecting Privacy and Robustness to Distributional
Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model.
The objective is to endow the trained model with robustness against adversarially manipulated input data.
Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.