Related papers: Rethinking Data Augmentation for Tabular Data in Deep Learning

Rethinking Data Augmentation for Tabular Data in Deep Learning

URL: http://arxiv.org/abs/2305.10308v2
Date: Mon, 22 May 2023 13:02:40 GMT
Title: Rethinking Data Augmentation for Tabular Data in Deep Learning
Authors: Soma Onishi and Shoya Meguro
Abstract summary: Tabular data is the most widely used data format in machine learning (ML) Recent literature reports that self-supervised learning with Transformer-based models outperforms tree-based methods.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tabular data is the most widely used data format in machine learning (ML). While tree-based methods outperform DL-based methods in supervised learning, recent literature reports that self-supervised learning with Transformer-based models outperforms tree-based methods. In the existing literature on self-supervised learning for tabular data, contrastive learning is the predominant method. In contrastive learning, data augmentation is important to generate different views. However, data augmentation for tabular data has been difficult due to the unique structure and high complexity of tabular data. In addition, three main components are proposed together in existing methods: model structure, self-supervised learning methods, and data augmentation. Therefore, previous works have compared the performance without comprehensively considering these components, and it is not clear how each component affects the actual performance. In this study, we focus on data augmentation to address these issues. We propose a novel data augmentation method, $\textbf{M}$ask $\textbf{T}$oken $\textbf{R}$eplacement ($\texttt{MTR}$), which replaces the mask token with a portion of each tokenized column; $\texttt{MTR}$ takes advantage of the properties of Transformer, which is becoming the predominant DL-based architecture for tabular data, to perform data augmentation for each column embedding. Through experiments with 13 diverse public datasets in both supervised and self-supervised learning scenarios, we show that $\texttt{MTR}$ achieves competitive performance against existing data augmentation methods and improves model performance. In addition, we discuss specific scenarios in which $\texttt{MTR}$ is most effective and identify the scope of its application. The code is available at https://github.com/somaonishi/MTR/.

Related papers

Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality [67.67387254989018]
We study model performance at various compute budgets and across multiple pre-training datasets created through data filtering and deduplication. We find that, given appropriate modifications to the training recipe, repeating existing aggressively filtered datasets for up to ten epochs can outperform training on the ten times larger superset for a single epoch across multiple compute budget orders of magnitude.
arXiv Detail & Related papers (2025-03-10T21:51:17Z)
Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes [135.68092471784516]
We propose a simple and lightweight approach for fusing large language models and gradient-boosted decision trees. We name our fusion methods LLM-Boost and PFN-Boost, respectively. We demonstrate state-of-the-art performance against numerous baselines and ensembling algorithms.
arXiv Detail & Related papers (2025-02-04T19:30:41Z)
$\texttt{dattri}$: A Library for Efficient Data Attribution [7.803566162554017]
Data attribution methods aim to quantify the influence of individual training samples on the prediction of artificial intelligence (AI) models. Despite a surge of new data attribution methods being developed, there lacks a comprehensive library that facilitates the development, benchmarking, and deployment of different data attribution methods. In this work, we introduce $textttdattri$, an open-source data attribution library that addresses the above needs.
arXiv Detail & Related papers (2024-10-06T17:18:09Z)
TabEBM: A Tabular Data Augmentation Method with Distinct Class-Specific Energy-Based Models [10.88959673845634]
TabEBM is a class-conditional generative method using Energy-Based Models (EBMs) Our experiments show that TabEBM generates synthetic data with higher quality and better statistical fidelity than existing methods.
arXiv Detail & Related papers (2024-09-24T14:25:59Z)
Tabular Transfer Learning via Prompting LLMs [52.96022335067357]
We propose a novel framework, Prompt to Transfer (P2T), that utilizes unlabeled (or heterogeneous) source data with large language models (LLMs) P2T identifies a column feature in a source dataset that is strongly correlated with a target task feature to create examples relevant to the target task, thus creating pseudo-demonstrations for prompts.
arXiv Detail & Related papers (2024-08-09T11:30:52Z)
A Closer Look at Deep Learning on Tabular Data [52.50778536274327]
Tabular data is prevalent across various domains in machine learning. Deep Neural Network (DNN)-based methods have shown promising performance comparable to tree-based ones.
arXiv Detail & Related papers (2024-07-01T04:24:07Z)
TabMDA: Tabular Manifold Data Augmentation for Any Classifier using Transformers with In-context Subsetting [23.461204546005387]
TabMDA is a novel method for manifold data augmentation on tabular data. It exploits a pre-trained in-context model, such as TabPFN, to map the data into an embedding space. We evaluate TabMDA on five standard classifiers and observe significant performance improvements across various datasets.
arXiv Detail & Related papers (2024-06-03T21:51:13Z)
Relational Deep Learning: Graph Representation Learning on Relational Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables. Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z)
Training-Free Generalization on Heterogeneous Tabular Data via Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM) A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences. Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z)
PTab: Using the Pre-trained Language Model for Modeling Tabular Data [5.791972449406902]
Recent studies show that neural-based models are effective in learning contextual representation for Tabular data. We propose a novel framework PTab, using the Pre-trained language model to model Tabular data. Our method has achieved a better average AUC score in supervised settings compared to the state-of-the-art baselines.
arXiv Detail & Related papers (2022-09-15T08:58:42Z)
DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification [56.817386699291305]
This paper proposes a simple yet effective data augmentation approach termed DoubleMix. DoubleMix first generates several perturbed samples for each training data. It then uses the perturbed data and original data to carry out a two-step in the hidden space of neural models.
arXiv Detail & Related papers (2022-09-12T15:01:04Z)
SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning [5.5616364225463055]
We introduce a new framework, Subsetting features of Tabular data (SubTab) In this paper, we introduce a new framework, Subsetting features of Tabular data (SubTab) We argue that reconstructing the data from the subset of its features rather than its corrupted version in an autoencoder setting can better capture its underlying representation.
arXiv Detail & Related papers (2021-10-08T20:11:09Z)
Relation-Guided Representation Learning [53.60351496449232]
We propose a new representation learning method that explicitly models and leverages sample relations. Our framework well preserves the relations between samples. By seeking to embed samples into subspace, we show that our method can address the large-scale and out-of-sample problem.
arXiv Detail & Related papers (2020-07-11T10:57:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.