Rethinking Data Augmentation for Tabular Data in Deep Learning
- URL: http://arxiv.org/abs/2305.10308v2
- Date: Mon, 22 May 2023 13:02:40 GMT
- Title: Rethinking Data Augmentation for Tabular Data in Deep Learning
- Authors: Soma Onishi and Shoya Meguro
- Abstract summary: Tabular data is the most widely used data format in machine learning (ML)
Recent literature reports that self-supervised learning with Transformer-based models outperforms tree-based methods.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tabular data is the most widely used data format in machine learning (ML).
While tree-based methods outperform DL-based methods in supervised learning,
recent literature reports that self-supervised learning with Transformer-based
models outperforms tree-based methods. In the existing literature on
self-supervised learning for tabular data, contrastive learning is the
predominant method. In contrastive learning, data augmentation is important to
generate different views. However, data augmentation for tabular data has been
difficult due to the unique structure and high complexity of tabular data. In
addition, three main components are proposed together in existing methods:
model structure, self-supervised learning methods, and data augmentation.
Therefore, previous works have compared the performance without comprehensively
considering these components, and it is not clear how each component affects
the actual performance.
In this study, we focus on data augmentation to address these issues. We
propose a novel data augmentation method, $\textbf{M}$ask $\textbf{T}$oken
$\textbf{R}$eplacement ($\texttt{MTR}$), which replaces the mask token with a
portion of each tokenized column; $\texttt{MTR}$ takes advantage of the
properties of Transformer, which is becoming the predominant DL-based
architecture for tabular data, to perform data augmentation for each column
embedding. Through experiments with 13 diverse public datasets in both
supervised and self-supervised learning scenarios, we show that $\texttt{MTR}$
achieves competitive performance against existing data augmentation methods and
improves model performance. In addition, we discuss specific scenarios in which
$\texttt{MTR}$ is most effective and identify the scope of its application. The
code is available at https://github.com/somaonishi/MTR/.
Related papers
- $\texttt{dattri}$: A Library for Efficient Data Attribution [7.803566162554017]
Data attribution methods aim to quantify the influence of individual training samples on the prediction of artificial intelligence (AI) models.
Despite a surge of new data attribution methods being developed, there lacks a comprehensive library that facilitates the development, benchmarking, and deployment of different data attribution methods.
In this work, we introduce $textttdattri$, an open-source data attribution library that addresses the above needs.
arXiv Detail & Related papers (2024-10-06T17:18:09Z) - TabEBM: A Tabular Data Augmentation Method with Distinct Class-Specific Energy-Based Models [10.88959673845634]
TabEBM is a class-conditional generative method using Energy-Based Models (EBMs)
Our experiments show that TabEBM generates synthetic data with higher quality and better statistical fidelity than existing methods.
arXiv Detail & Related papers (2024-09-24T14:25:59Z) - Tabular Transfer Learning via Prompting LLMs [52.96022335067357]
We propose a novel framework, Prompt to Transfer (P2T), that utilizes unlabeled (or heterogeneous) source data with large language models (LLMs)
P2T identifies a column feature in a source dataset that is strongly correlated with a target task feature to create examples relevant to the target task, thus creating pseudo-demonstrations for prompts.
arXiv Detail & Related papers (2024-08-09T11:30:52Z) - A Closer Look at Deep Learning on Tabular Data [52.50778536274327]
Tabular data is prevalent across various domains in machine learning.
Deep Neural Network (DNN)-based methods have shown promising performance comparable to tree-based ones.
arXiv Detail & Related papers (2024-07-01T04:24:07Z) - TabMDA: Tabular Manifold Data Augmentation for Any Classifier using Transformers with In-context Subsetting [23.461204546005387]
TabMDA is a novel method for manifold data augmentation on tabular data.
It exploits a pre-trained in-context model, such as TabPFN, to map the data into an embedding space.
We evaluate TabMDA on five standard classifiers and observe significant performance improvements across various datasets.
arXiv Detail & Related papers (2024-06-03T21:51:13Z) - Relational Deep Learning: Graph Representation Learning on Relational
Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables.
Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - PTab: Using the Pre-trained Language Model for Modeling Tabular Data [5.791972449406902]
Recent studies show that neural-based models are effective in learning contextual representation for Tabular data.
We propose a novel framework PTab, using the Pre-trained language model to model Tabular data.
Our method has achieved a better average AUC score in supervised settings compared to the state-of-the-art baselines.
arXiv Detail & Related papers (2022-09-15T08:58:42Z) - DoubleMix: Simple Interpolation-Based Data Augmentation for Text
Classification [56.817386699291305]
This paper proposes a simple yet effective data augmentation approach termed DoubleMix.
DoubleMix first generates several perturbed samples for each training data.
It then uses the perturbed data and original data to carry out a two-step in the hidden space of neural models.
arXiv Detail & Related papers (2022-09-12T15:01:04Z) - SubTab: Subsetting Features of Tabular Data for Self-Supervised
Representation Learning [5.5616364225463055]
We introduce a new framework, Subsetting features of Tabular data (SubTab)
In this paper, we introduce a new framework, Subsetting features of Tabular data (SubTab)
We argue that reconstructing the data from the subset of its features rather than its corrupted version in an autoencoder setting can better capture its underlying representation.
arXiv Detail & Related papers (2021-10-08T20:11:09Z) - Relation-Guided Representation Learning [53.60351496449232]
We propose a new representation learning method that explicitly models and leverages sample relations.
Our framework well preserves the relations between samples.
By seeking to embed samples into subspace, we show that our method can address the large-scale and out-of-sample problem.
arXiv Detail & Related papers (2020-07-11T10:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.