Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML
Evaluation
- URL: http://arxiv.org/abs/2211.13358v2
- Date: Mon, 28 Nov 2022 11:17:46 GMT
- Title: Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML
Evaluation
- Authors: S\'ergio Jesus, Jos\'e Pombal, Duarte Alves, Andr\'e Cruz, Pedro
Saleiro, Rita P. Ribeiro, Jo\~ao Gama, Pedro Bizarro
- Abstract summary: We present Bank Account Fraud (BAF), the first publicly available privacy-preserving, large-scale, realistic suite of tabular datasets.
BAF is a set of challenges commonplace in real-world applications, including temporal dynamics and significant class imbalance.
We aim to provide the research community with a more realistic, complete, and robust test bed to evaluate novel and existing methods.
- Score: 3.737892247639591
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating new techniques on realistic datasets plays a crucial role in the
development of ML research and its broader adoption by practitioners. In recent
years, there has been a significant increase of publicly available unstructured
data resources for computer vision and NLP tasks. However, tabular data --
which is prevalent in many high-stakes domains -- has been lagging behind. To
bridge this gap, we present Bank Account Fraud (BAF), the first publicly
available privacy-preserving, large-scale, realistic suite of tabular datasets.
The suite was generated by applying state-of-the-art tabular data generation
techniques on an anonymized,real-world bank account opening fraud detection
dataset. This setting carries a set of challenges that are commonplace in
real-world applications, including temporal dynamics and significant class
imbalance. Additionally, to allow practitioners to stress test both performance
and fairness of ML methods, each dataset variant of BAF contains specific types
of data bias. With this resource, we aim to provide the research community with
a more realistic, complete, and robust test bed to evaluate novel and existing
methods.
Related papers
- Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol [83.90769864167301]
Literature review tables are essential for summarizing and comparing collections of scientific papers.
We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers.
Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques.
arXiv Detail & Related papers (2025-04-14T14:52:28Z) - LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation [49.898152180805454]
This study is the first to explicitly address inter-column relationship preservation in synthetic tabular data generation.
LLM-TabFlow is a novel approach that captures complex inter-column relationships and compress data, while using Score-based Diffusion to model the distribution of the compressed data in latent space.
Our results show that LLM-TabFlow outperforms all baselines, fully preserving inter-column relationships while achieving the best balance between data fidelity, utility, and privacy.
arXiv Detail & Related papers (2025-03-04T00:47:52Z) - FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting [58.70072722290475]
Financial time series (FinTS) record the behavior of human-brain-augmented decision-making.
FinTSB is a comprehensive and practical benchmark for financial time series forecasting.
arXiv Detail & Related papers (2025-02-26T05:19:16Z) - Class-Imbalanced-Aware Adaptive Dataset Distillation for Scalable Pretrained Model on Credit Scoring [10.737033782376905]
We present a novel framework for scaling up the application of large pretrained models on financial datasets.
We integrate the imbalance-aware techniques during dataset distillation, resulting in improved performance in financial datasets.
arXiv Detail & Related papers (2025-01-18T06:59:36Z) - Benchmarking Table Comprehension In The Wild [9.224698222634789]
TableQuest is a new benchmark designed to evaluate the holistic table comprehension capabilities of Large Language Models (LLMs)
We experiment with 7 state-of-the-art models, and find that despite reasonable accuracy in locating facts, they often falter when required to execute more sophisticated reasoning or multi-step calculations.
arXiv Detail & Related papers (2024-12-13T05:52:37Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings.
Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z) - Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science [17.910306140400046]
This research endeavors to apply Large Language Models (LLMs) towards addressing these predictive tasks.
Our research aims to mitigate this gap by compiling a comprehensive corpus of tables annotated with instructions and executing large-scale training of Llama-2.
arXiv Detail & Related papers (2024-03-29T14:41:21Z) - A Dataset for the Validation of Truth Inference Algorithms Suitable for Online Deployment [76.04306818209753]
We introduce a substantial crowdsourcing annotation dataset collected from a real-world crowdsourcing platform.
This dataset comprises approximately two thousand workers, one million tasks, and six million annotations.
We evaluate the effectiveness of several representative truth inference algorithms on this dataset.
arXiv Detail & Related papers (2024-03-10T16:00:41Z) - ACLSum: A New Dataset for Aspect-based Summarization of Scientific
Publications [10.529898520273063]
ACLSum is a novel summarization dataset carefully crafted and evaluated by domain experts.
In contrast to previous datasets, ACLSum facilitates multi-aspect summarization of scientific papers.
arXiv Detail & Related papers (2024-03-08T13:32:01Z) - Towards Cross-Table Masked Pretraining for Web Data Mining [22.952238405240188]
We propose an innovative, generic, and efficient cross-table pretraining framework, dubbed as CM2.
Our experiments demonstrate CM2's state-of-the-art performance and validate that cross-table pretraining can enhance various downstream tasks.
arXiv Detail & Related papers (2023-07-10T02:27:38Z) - Dataset Distillation: A Comprehensive Review [76.26276286545284]
dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset.
This paper gives a comprehensive review and summary of recent advances in DD and its application.
arXiv Detail & Related papers (2023-01-17T17:03:28Z) - Is margin all you need? An extensive empirical study of active learning
on tabular data [66.18464006872345]
We analyze the performance of a variety of active learning algorithms on 69 real-world datasets from the OpenML-CC18 benchmark.
Surprisingly, we find that the classical margin sampling technique matches or outperforms all others, including current state-of-art.
arXiv Detail & Related papers (2022-10-07T21:18:24Z) - Deeply-Learned Generalized Linear Models with Missing Data [6.302686933168439]
We provide a formal treatment of missing data in the context of deeply learned generalized linear models.
We propose a new architecture, textitdlglm, that is able to flexibly account for both ignorable and non-ignorable patterns of missingness.
We conclude with a case study of a Bank Marketing dataset from the UCI Machine Learning Repository.
arXiv Detail & Related papers (2022-07-18T20:00:13Z) - Super-App Behavioral Patterns in Credit Risk Models: Financial,
Statistical and Regulatory Implications [110.54266632357673]
We present the impact of alternative data that originates from an app-based marketplace, in contrast to traditional bureau data, upon credit scoring models.
Our results, validated across two countries, show that these new sources of data are particularly useful for predicting financial behavior in low-wealth and young individuals.
arXiv Detail & Related papers (2020-05-09T01:32:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.