Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML
Evaluation
- URL: http://arxiv.org/abs/2211.13358v2
- Date: Mon, 28 Nov 2022 11:17:46 GMT
- Title: Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML
Evaluation
- Authors: S\'ergio Jesus, Jos\'e Pombal, Duarte Alves, Andr\'e Cruz, Pedro
Saleiro, Rita P. Ribeiro, Jo\~ao Gama, Pedro Bizarro
- Abstract summary: We present Bank Account Fraud (BAF), the first publicly available privacy-preserving, large-scale, realistic suite of tabular datasets.
BAF is a set of challenges commonplace in real-world applications, including temporal dynamics and significant class imbalance.
We aim to provide the research community with a more realistic, complete, and robust test bed to evaluate novel and existing methods.
- Score: 3.737892247639591
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating new techniques on realistic datasets plays a crucial role in the
development of ML research and its broader adoption by practitioners. In recent
years, there has been a significant increase of publicly available unstructured
data resources for computer vision and NLP tasks. However, tabular data --
which is prevalent in many high-stakes domains -- has been lagging behind. To
bridge this gap, we present Bank Account Fraud (BAF), the first publicly
available privacy-preserving, large-scale, realistic suite of tabular datasets.
The suite was generated by applying state-of-the-art tabular data generation
techniques on an anonymized,real-world bank account opening fraud detection
dataset. This setting carries a set of challenges that are commonplace in
real-world applications, including temporal dynamics and significant class
imbalance. Additionally, to allow practitioners to stress test both performance
and fairness of ML methods, each dataset variant of BAF contains specific types
of data bias. With this resource, we aim to provide the research community with
a more realistic, complete, and robust test bed to evaluate novel and existing
methods.
Related papers
- Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks.
We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z) - Class-Imbalanced-Aware Adaptive Dataset Distillation for Scalable Pretrained Model on Credit Scoring [10.737033782376905]
We present a novel framework for scaling up the application of large pretrained models on financial datasets.
We integrate the imbalance-aware techniques during dataset distillation, resulting in improved performance in financial datasets.
arXiv Detail & Related papers (2025-01-18T06:59:36Z) - Benchmarking Table Comprehension In The Wild [9.224698222634789]
TableQuest is a new benchmark designed to evaluate the holistic table comprehension capabilities of Large Language Models (LLMs)
We experiment with 7 state-of-the-art models, and find that despite reasonable accuracy in locating facts, they often falter when required to execute more sophisticated reasoning or multi-step calculations.
arXiv Detail & Related papers (2024-12-13T05:52:37Z) - Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings.
Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z) - Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science [17.282770819829913]
This research endeavors to apply Large Language Models (LLMs) towards addressing these predictive tasks.
Our research aims to mitigate this gap by compiling a comprehensive corpus of tables annotated with instructions and executing large-scale training of Llama-2.
arXiv Detail & Related papers (2024-03-29T14:41:21Z) - A Dataset for the Validation of Truth Inference Algorithms Suitable for Online Deployment [76.04306818209753]
We introduce a substantial crowdsourcing annotation dataset collected from a real-world crowdsourcing platform.
This dataset comprises approximately two thousand workers, one million tasks, and six million annotations.
We evaluate the effectiveness of several representative truth inference algorithms on this dataset.
arXiv Detail & Related papers (2024-03-10T16:00:41Z) - Towards Cross-Table Masked Pretraining for Web Data Mining [22.952238405240188]
We propose an innovative, generic, and efficient cross-table pretraining framework, dubbed as CM2.
Our experiments demonstrate CM2's state-of-the-art performance and validate that cross-table pretraining can enhance various downstream tasks.
arXiv Detail & Related papers (2023-07-10T02:27:38Z) - Dataset Distillation: A Comprehensive Review [76.26276286545284]
dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset.
This paper gives a comprehensive review and summary of recent advances in DD and its application.
arXiv Detail & Related papers (2023-01-17T17:03:28Z) - Is margin all you need? An extensive empirical study of active learning
on tabular data [66.18464006872345]
We analyze the performance of a variety of active learning algorithms on 69 real-world datasets from the OpenML-CC18 benchmark.
Surprisingly, we find that the classical margin sampling technique matches or outperforms all others, including current state-of-art.
arXiv Detail & Related papers (2022-10-07T21:18:24Z) - Super-App Behavioral Patterns in Credit Risk Models: Financial,
Statistical and Regulatory Implications [110.54266632357673]
We present the impact of alternative data that originates from an app-based marketplace, in contrast to traditional bureau data, upon credit scoring models.
Our results, validated across two countries, show that these new sources of data are particularly useful for predicting financial behavior in low-wealth and young individuals.
arXiv Detail & Related papers (2020-05-09T01:32:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.