Comparative Analysis of Transformers for Modeling Tabular Data: A
Casestudy using Industry Scale Dataset
- URL: http://arxiv.org/abs/2311.14335v1
- Date: Fri, 24 Nov 2023 08:16:39 GMT
- Title: Comparative Analysis of Transformers for Modeling Tabular Data: A
Casestudy using Industry Scale Dataset
- Authors: Usneek Singh, Piyush Arora, Shamika Ganesan, Mohit Kumar, Siddhant
Kulkarni, Salil R. Joshi
- Abstract summary: The study conducts an extensive examination of various transformer-based models using both synthetic datasets and the default prediction Kaggle dataset (2022) from American Express.
The paper presents crucial insights into optimal data pre-processing, compares pre-training and direct supervised learning methods, discusses strategies for managing categorical and numerical features, and highlights trade-offs between computational resources and performance.
- Score: 1.0758036046280266
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We perform a comparative analysis of transformer-based models designed for
modeling tabular data, specifically on an industry-scale dataset. While earlier
studies demonstrated promising outcomes on smaller public or synthetic
datasets, the effectiveness did not extend to larger industry-scale datasets.
The challenges identified include handling high-dimensional data, the necessity
for efficient pre-processing of categorical and numerical features, and
addressing substantial computational requirements.
To overcome the identified challenges, the study conducts an extensive
examination of various transformer-based models using both synthetic datasets
and the default prediction Kaggle dataset (2022) from American Express. The
paper presents crucial insights into optimal data pre-processing, compares
pre-training and direct supervised learning methods, discusses strategies for
managing categorical and numerical features, and highlights trade-offs between
computational resources and performance. Focusing on temporal financial data
modeling, the research aims to facilitate the systematic development and
deployment of transformer-based models in real-world scenarios, emphasizing
scalability.
Related papers
- Diffusion Models for Tabular Data Imputation and Synthetic Data Generation [3.667364190843767]
Diffusion models have emerged as powerful generative models capable of capturing complex data distributions.
In this paper, we propose a diffusion model for tabular data that introduces three key enhancements.
The conditioning attention mechanism is designed to improve the model's ability to capture the relationship between the condition and synthetic data.
The transformer layers help model interactions within the condition (encoder) or synthetic data (decoder), while dynamic masking enables our model to efficiently handle both missing data imputation and synthetic data generation tasks.
arXiv Detail & Related papers (2024-07-02T15:27:06Z) - A Data-Centric Perspective on Evaluating Machine Learning Models for Tabular Data [9.57464542357693]
This paper demonstrates that model-centric evaluations are biased, as real-world modeling pipelines often require dataset-specific preprocessing and feature engineering.
We select 10 relevant datasets from Kaggle competitions and implement expert-level preprocessing pipelines for each dataset.
After dataset-specific feature engineering, model rankings change considerably, performance differences decrease, and the importance of model selection reduces.
arXiv Detail & Related papers (2024-07-02T09:54:39Z) - Expansive Synthesis: Generating Large-Scale Datasets from Minimal Samples [13.053285552524052]
This paper introduces an innovative Expansive Synthesis model that generates high-fidelity datasets from minimal samples.
We validate our Expansive Synthesis by training classifiers on the generated datasets and comparing their performance toversas trained on larger, original datasets.
arXiv Detail & Related papers (2024-06-25T02:59:02Z) - A Comprehensive Survey on Data Augmentation [55.355273602421384]
Data augmentation is a technique that generates high-quality artificial data by manipulating existing data samples.
Existing literature surveys only focus on a certain type of specific modality data.
We propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities.
arXiv Detail & Related papers (2024-05-15T11:58:08Z) - Best Practices and Lessons Learned on Synthetic Data for Language Models [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - Comprehensive Exploration of Synthetic Data Generation: A Survey [4.485401662312072]
This work surveys 417 Synthetic Data Generation models over the last decade.
The findings reveal increased model performance and complexity, with neural network-based approaches prevailing.
Computer vision dominates, with GANs as primary generative models, while diffusion models, transformers, and RNNs compete.
arXiv Detail & Related papers (2024-01-04T20:23:51Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Does Synthetic Data Make Large Language Models More Efficient? [0.0]
This paper explores the nuances of synthetic data generation in NLP.
We highlight its advantages, including data augmentation potential and the introduction of structured variety.
We demonstrate the impact of template-based synthetic data on the performance of modern transformer models.
arXiv Detail & Related papers (2023-10-11T19:16:09Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.