Comparative Analysis of Transformers for Modeling Tabular Data: A
Casestudy using Industry Scale Dataset
- URL: http://arxiv.org/abs/2311.14335v1
- Date: Fri, 24 Nov 2023 08:16:39 GMT
- Title: Comparative Analysis of Transformers for Modeling Tabular Data: A
Casestudy using Industry Scale Dataset
- Authors: Usneek Singh, Piyush Arora, Shamika Ganesan, Mohit Kumar, Siddhant
Kulkarni, Salil R. Joshi
- Abstract summary: The study conducts an extensive examination of various transformer-based models using both synthetic datasets and the default prediction Kaggle dataset (2022) from American Express.
The paper presents crucial insights into optimal data pre-processing, compares pre-training and direct supervised learning methods, discusses strategies for managing categorical and numerical features, and highlights trade-offs between computational resources and performance.
- Score: 1.0758036046280266
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We perform a comparative analysis of transformer-based models designed for
modeling tabular data, specifically on an industry-scale dataset. While earlier
studies demonstrated promising outcomes on smaller public or synthetic
datasets, the effectiveness did not extend to larger industry-scale datasets.
The challenges identified include handling high-dimensional data, the necessity
for efficient pre-processing of categorical and numerical features, and
addressing substantial computational requirements.
To overcome the identified challenges, the study conducts an extensive
examination of various transformer-based models using both synthetic datasets
and the default prediction Kaggle dataset (2022) from American Express. The
paper presents crucial insights into optimal data pre-processing, compares
pre-training and direct supervised learning methods, discusses strategies for
managing categorical and numerical features, and highlights trade-offs between
computational resources and performance. Focusing on temporal financial data
modeling, the research aims to facilitate the systematic development and
deployment of transformer-based models in real-world scenarios, emphasizing
scalability.
Related papers
- Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks.
We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z) - Generalized Factor Neural Network Model for High-dimensional Regression [50.554377879576066]
We tackle the challenges of modeling high-dimensional data sets with latent low-dimensional structures hidden within complex, non-linear, and noisy relationships.
Our approach enables a seamless integration of concepts from non-parametric regression, factor models, and neural networks for high-dimensional regression.
arXiv Detail & Related papers (2025-02-16T23:13:55Z) - Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis [0.0]
This paper introduces a novel approach that leverages three generative models of varying complexity to synthesize Malicious Network Traffic.
Our approach transforms numerical data into text, re-framing data generation as a language modeling task.
Our method surpasses state-of-the-art generative models in producing high-fidelity synthetic data.
arXiv Detail & Related papers (2024-11-04T09:51:10Z) - On the Diversity of Synthetic Data and its Impact on Training Large Language Models [34.00031258223175]
Large Language Models (LLMs) have accentuated the need for diverse, high-quality pre-training data.
Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility.
We study the downstream effects of synthetic data diversity during both the pre-training and fine-tuning stages.
arXiv Detail & Related papers (2024-10-19T22:14:07Z) - Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development [67.55944651679864]
We present a new sandbox suite tailored for integrated data-model co-development.
This sandbox provides a feedback-driven experimental platform, enabling cost-effective and guided refinement of both data and models.
arXiv Detail & Related papers (2024-07-16T14:40:07Z) - Diffusion Models for Tabular Data Imputation and Synthetic Data Generation [3.667364190843767]
Diffusion models have emerged as powerful generative models capable of capturing complex data distributions.
In this paper, we propose a diffusion model for tabular data that introduces three key enhancements.
The conditioning attention mechanism is designed to improve the model's ability to capture the relationship between the condition and synthetic data.
The transformer layers help model interactions within the condition (encoder) or synthetic data (decoder), while dynamic masking enables our model to efficiently handle both missing data imputation and synthetic data generation tasks.
arXiv Detail & Related papers (2024-07-02T15:27:06Z) - A Comprehensive Survey on Data Augmentation [55.355273602421384]
Data augmentation is a technique that generates high-quality artificial data by manipulating existing data samples.
Existing literature surveys only focus on a certain type of specific modality data.
We propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities.
arXiv Detail & Related papers (2024-05-15T11:58:08Z) - Comprehensive Exploration of Synthetic Data Generation: A Survey [4.485401662312072]
This work surveys 417 Synthetic Data Generation models over the last decade.
The findings reveal increased model performance and complexity, with neural network-based approaches prevailing.
Computer vision dominates, with GANs as primary generative models, while diffusion models, transformers, and RNNs compete.
arXiv Detail & Related papers (2024-01-04T20:23:51Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Does Synthetic Data Make Large Language Models More Efficient? [0.0]
This paper explores the nuances of synthetic data generation in NLP.
We highlight its advantages, including data augmentation potential and the introduction of structured variety.
We demonstrate the impact of template-based synthetic data on the performance of modern transformer models.
arXiv Detail & Related papers (2023-10-11T19:16:09Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.