SubTab: Subsetting Features of Tabular Data for Self-Supervised
Representation Learning
- URL: http://arxiv.org/abs/2110.04361v1
- Date: Fri, 8 Oct 2021 20:11:09 GMT
- Title: SubTab: Subsetting Features of Tabular Data for Self-Supervised
Representation Learning
- Authors: Talip Ucar, Ehsan Hajiramezanali, Lindsay Edwards
- Abstract summary: We introduce a new framework, Subsetting features of Tabular data (SubTab)
In this paper, we introduce a new framework, Subsetting features of Tabular data (SubTab)
We argue that reconstructing the data from the subset of its features rather than its corrupted version in an autoencoder setting can better capture its underlying representation.
- Score: 5.5616364225463055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning has been shown to be very effective in learning
useful representations, and yet much of the success is achieved in data types
such as images, audio, and text. The success is mainly enabled by taking
advantage of spatial, temporal, or semantic structure in the data through
augmentation. However, such structure may not exist in tabular datasets
commonly used in fields such as healthcare, making it difficult to design an
effective augmentation method, and hindering a similar progress in tabular data
setting. In this paper, we introduce a new framework, Subsetting features of
Tabular data (SubTab), that turns the task of learning from tabular data into a
multi-view representation learning problem by dividing the input features to
multiple subsets. We argue that reconstructing the data from the subset of its
features rather than its corrupted version in an autoencoder setting can better
capture its underlying latent representation. In this framework, the joint
representation can be expressed as the aggregate of latent variables of the
subsets at test time, which we refer to as collaborative inference. Our
experiments show that the SubTab achieves the state of the art (SOTA)
performance of 98.31% on MNIST in tabular setting, on par with CNN-based SOTA
models, and surpasses existing baselines on three other real-world datasets by
a significant margin.
Related papers
- Knowledge-Aware Reasoning over Multimodal Semi-structured Tables [85.24395216111462]
This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data.
We introduce MMTabQA, a new dataset designed for this purpose.
Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs.
arXiv Detail & Related papers (2024-08-25T15:17:43Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - Making Pre-trained Language Models Great on Tabular Prediction [50.70574370855663]
The transferability of deep neural networks (DNNs) has made significant progress in image and language processing.
We present TP-BERTa, a specifically pre-trained LM for tabular data prediction.
A novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names.
arXiv Detail & Related papers (2024-03-04T08:38:56Z) - SwitchTab: Switched Autoencoders Are Effective Tabular Learners [16.316153704284936]
We introduce SwitchTab, a novel self-supervised representation method for tabular data.
SwitchTab captures latent dependencies by decouples mutual and salient features among data pairs.
Results show superior performance in end-to-end prediction tasks with fine-tuning.
We highlight the capability of SwitchTab to create explainable representations through visualization of decoupled mutual and salient features in the latent space.
arXiv Detail & Related papers (2024-01-04T01:05:45Z) - Tabular Few-Shot Generalization Across Heterogeneous Feature Spaces [43.67453625260335]
We propose a novel approach to few-shot learning involving knowledge sharing between datasets with heterogeneous feature spaces.
FLAT learns low-dimensional embeddings of datasets and their individual columns, which facilitate knowledge transfer and generalization to previously unseen datasets.
A decoder network parametrizes the predictive target network, implemented as a Graph Attention Network, to accommodate the heterogeneous nature of tabular datasets.
arXiv Detail & Related papers (2023-11-16T17:45:59Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - Learning Representations without Compositional Assumptions [79.12273403390311]
We propose a data-driven approach that learns feature set dependencies by representing feature sets as graph nodes and their relationships as learnable edges.
We also introduce LEGATO, a novel hierarchical graph autoencoder that learns a smaller, latent graph to aggregate information from multiple views dynamically.
arXiv Detail & Related papers (2023-05-31T10:36:10Z) - SEMv2: Table Separation Line Detection Based on Instance Segmentation [96.36188168694781]
We propose an accurate table structure recognizer, termed SEMv2 (SEM: Split, Embed and Merge)
We address the table separation line instance-level discrimination problem and introduce a table separation line detection strategy based on conditional convolution.
To comprehensively evaluate the SEMv2, we also present a more challenging dataset for table structure recognition, dubbed iFLYTAB.
arXiv Detail & Related papers (2023-03-08T05:15:01Z) - PTab: Using the Pre-trained Language Model for Modeling Tabular Data [5.791972449406902]
Recent studies show that neural-based models are effective in learning contextual representation for Tabular data.
We propose a novel framework PTab, using the Pre-trained language model to model Tabular data.
Our method has achieved a better average AUC score in supervised settings compared to the state-of-the-art baselines.
arXiv Detail & Related papers (2022-09-15T08:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.