Related papers: Comparative Analysis of Transformers for Modeling Tabular Data: A Casestudy using Industry Scale Dataset

Comparative Analysis of Transformers for Modeling Tabular Data: A Casestudy using Industry Scale Dataset

URL: http://arxiv.org/abs/2311.14335v1
Date: Fri, 24 Nov 2023 08:16:39 GMT
Title: Comparative Analysis of Transformers for Modeling Tabular Data: A Casestudy using Industry Scale Dataset
Authors: Usneek Singh, Piyush Arora, Shamika Ganesan, Mohit Kumar, Siddhant Kulkarni, Salil R. Joshi
Abstract summary: The study conducts an extensive examination of various transformer-based models using both synthetic datasets and the default prediction Kaggle dataset (2022) from American Express. The paper presents crucial insights into optimal data pre-processing, compares pre-training and direct supervised learning methods, discusses strategies for managing categorical and numerical features, and highlights trade-offs between computational resources and performance.
Score: 1.0758036046280266
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We perform a comparative analysis of transformer-based models designed for modeling tabular data, specifically on an industry-scale dataset. While earlier studies demonstrated promising outcomes on smaller public or synthetic datasets, the effectiveness did not extend to larger industry-scale datasets. The challenges identified include handling high-dimensional data, the necessity for efficient pre-processing of categorical and numerical features, and addressing substantial computational requirements. To overcome the identified challenges, the study conducts an extensive examination of various transformer-based models using both synthetic datasets and the default prediction Kaggle dataset (2022) from American Express. The paper presents crucial insights into optimal data pre-processing, compares pre-training and direct supervised learning methods, discusses strategies for managing categorical and numerical features, and highlights trade-offs between computational resources and performance. Focusing on temporal financial data modeling, the research aims to facilitate the systematic development and deployment of transformer-based models in real-world scenarios, emphasizing scalability.

Related papers

Prototype-Guided Diffusion for Digital Pathology: Achieving Foundation Model Performance with Minimal Clinical Data [6.318463500874778]
We propose a prototype-guided diffusion model to generate high-fidelity synthetic pathology data at scale. Our approach ensures biologically and diagnostically meaningful variations in the generated data. We demonstrate that self-supervised features trained on our synthetic dataset achieve competitive performance despite using 60x-760x less data than models trained on large real-world datasets.
arXiv Detail & Related papers (2025-04-15T21:17:39Z)
Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
Empowering Time Series Analysis with Synthetic Data: A Survey and Outlook in the Era of Foundation Models [104.17057231661371]
Time series analysis is crucial for understanding dynamics of complex systems. Recent advances in foundation models have led to task-agnostic Time Series Foundation Models (TSFMs) and Large Language Model-based Time Series Models (TSLLMs) Their success depends on large, diverse, and high-quality datasets, which are challenging to build due to regulatory, diversity, quality, and quantity constraints. This survey provides a comprehensive review of synthetic data for TSFMs and TSLLMs, analyzing data generation strategies, their role in model pretraining, fine-tuning, and evaluation, and identifying future research directions.
arXiv Detail & Related papers (2025-03-14T13:53:46Z)
Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks. We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z)
Generalized Factor Neural Network Model for High-dimensional Regression [50.554377879576066]
We tackle the challenges of modeling high-dimensional data sets with latent low-dimensional structures hidden within complex, non-linear, and noisy relationships. Our approach enables a seamless integration of concepts from non-parametric regression, factor models, and neural networks for high-dimensional regression.
arXiv Detail & Related papers (2025-02-16T23:13:55Z)
Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis [0.0]
This paper introduces a novel approach that leverages three generative models of varying complexity to synthesize Malicious Network Traffic. Our approach transforms numerical data into text, re-framing data generation as a language modeling task. Our method surpasses state-of-the-art generative models in producing high-fidelity synthetic data.
arXiv Detail & Related papers (2024-11-04T09:51:10Z)
On the Diversity of Synthetic Data and its Impact on Training Large Language Models [34.00031258223175]
Large Language Models (LLMs) have accentuated the need for diverse, high-quality pre-training data. Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility. We study the downstream effects of synthetic data diversity during both the pre-training and fine-tuning stages.
arXiv Detail & Related papers (2024-10-19T22:14:07Z)
Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development [67.55944651679864]
We present a new sandbox suite tailored for integrated data-model co-development. This sandbox provides a feedback-driven experimental platform, enabling cost-effective and guided refinement of both data and models.
arXiv Detail & Related papers (2024-07-16T14:40:07Z)
Diffusion Models for Tabular Data Imputation and Synthetic Data Generation [3.667364190843767]
Diffusion models have emerged as powerful generative models capable of capturing complex data distributions. In this paper, we propose a diffusion model for tabular data that introduces three key enhancements. The conditioning attention mechanism is designed to improve the model's ability to capture the relationship between the condition and synthetic data. The transformer layers help model interactions within the condition (encoder) or synthetic data (decoder), while dynamic masking enables our model to efficiently handle both missing data imputation and synthetic data generation tasks.
arXiv Detail & Related papers (2024-07-02T15:27:06Z)
A Comprehensive Survey on Data Augmentation [55.355273602421384]
Data augmentation is a technique that generates high-quality artificial data by manipulating existing data samples. Existing literature surveys only focus on a certain type of specific modality data. We propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities.
arXiv Detail & Related papers (2024-05-15T11:58:08Z)
Comprehensive Exploration of Synthetic Data Generation: A Survey [4.485401662312072]
This work surveys 417 Synthetic Data Generation models over the last decade. The findings reveal increased model performance and complexity, with neural network-based approaches prevailing. Computer vision dominates, with GANs as primary generative models, while diffusion models, transformers, and RNNs compete.
arXiv Detail & Related papers (2024-01-04T20:23:51Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
Does Synthetic Data Make Large Language Models More Efficient? [0.0]
This paper explores the nuances of synthetic data generation in NLP. We highlight its advantages, including data augmentation potential and the introduction of structured variety. We demonstrate the impact of template-based synthetic data on the performance of modern transformer models.
arXiv Detail & Related papers (2023-10-11T19:16:09Z)
TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets. We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z)
HyperImpute: Generalized Iterative Imputation with Automatic Model Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models. We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z)
Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods. We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments. The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z)
Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.