SubTab: Subsetting Features of Tabular Data for Self-Supervised
Representation Learning
- URL: http://arxiv.org/abs/2110.04361v1
- Date: Fri, 8 Oct 2021 20:11:09 GMT
- Title: SubTab: Subsetting Features of Tabular Data for Self-Supervised
Representation Learning
- Authors: Talip Ucar, Ehsan Hajiramezanali, Lindsay Edwards
- Abstract summary: We introduce a new framework, Subsetting features of Tabular data (SubTab)
In this paper, we introduce a new framework, Subsetting features of Tabular data (SubTab)
We argue that reconstructing the data from the subset of its features rather than its corrupted version in an autoencoder setting can better capture its underlying representation.
- Score: 5.5616364225463055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning has been shown to be very effective in learning
useful representations, and yet much of the success is achieved in data types
such as images, audio, and text. The success is mainly enabled by taking
advantage of spatial, temporal, or semantic structure in the data through
augmentation. However, such structure may not exist in tabular datasets
commonly used in fields such as healthcare, making it difficult to design an
effective augmentation method, and hindering a similar progress in tabular data
setting. In this paper, we introduce a new framework, Subsetting features of
Tabular data (SubTab), that turns the task of learning from tabular data into a
multi-view representation learning problem by dividing the input features to
multiple subsets. We argue that reconstructing the data from the subset of its
features rather than its corrupted version in an autoencoder setting can better
capture its underlying latent representation. In this framework, the joint
representation can be expressed as the aggregate of latent variables of the
subsets at test time, which we refer to as collaborative inference. Our
experiments show that the SubTab achieves the state of the art (SOTA)
performance of 98.31% on MNIST in tabular setting, on par with CNN-based SOTA
models, and surpasses existing baselines on three other real-world datasets by
a significant margin.
Related papers
- Representation Learning for Tabular Data: A Comprehensive Survey [23.606506938919605]
Tabular data, structured as rows and columns, is among the most prevalent data types in machine learning classification and regression applications.
Deep Neural Networks (DNNs) have recently demonstrated promising results through their capability of representation learning.
We organize existing methods into three main categories according to their generalization capabilities.
arXiv Detail & Related papers (2025-04-17T17:58:23Z) - TabGLM: Tabular Graph Language Model for Learning Transferable Representations Through Multi-Modal Consistency Minimization [2.1067477213933503]
TabGLM (Tabular Graph Language Model) is a novel multi-modal architecture designed to model both structural and semantic information from a table.
It transforms each row of a table into a fully connected graph and serialized text, which are encoded using a graph neural network (GNN) and a text encoder, respectively.
Evaluations across 25 benchmark datasets demonstrate substantial performance gains.
arXiv Detail & Related papers (2025-02-26T05:32:45Z) - A Closer Look at TabPFN v2: Strength, Limitation, and Extension [51.08999772842298]
Tabular Prior-data Fitted Network v2 (TabPFN v2) achieves unprecedented in-context learning accuracy across multiple datasets.
In this paper, we evaluate TabPFN v2 on over 300 datasets, confirming its exceptional generalization capabilities on small- to medium-scale tasks.
arXiv Detail & Related papers (2025-02-24T17:38:42Z) - Knowledge-Aware Reasoning over Multimodal Semi-structured Tables [85.24395216111462]
This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data.
We introduce MMTabQA, a new dataset designed for this purpose.
Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs.
arXiv Detail & Related papers (2024-08-25T15:17:43Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - Making Pre-trained Language Models Great on Tabular Prediction [50.70574370855663]
The transferability of deep neural networks (DNNs) has made significant progress in image and language processing.
We present TP-BERTa, a specifically pre-trained LM for tabular data prediction.
A novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names.
arXiv Detail & Related papers (2024-03-04T08:38:56Z) - SwitchTab: Switched Autoencoders Are Effective Tabular Learners [16.316153704284936]
We introduce SwitchTab, a novel self-supervised representation method for tabular data.
SwitchTab captures latent dependencies by decouples mutual and salient features among data pairs.
Results show superior performance in end-to-end prediction tasks with fine-tuning.
We highlight the capability of SwitchTab to create explainable representations through visualization of decoupled mutual and salient features in the latent space.
arXiv Detail & Related papers (2024-01-04T01:05:45Z) - Tabular Few-Shot Generalization Across Heterogeneous Feature Spaces [43.67453625260335]
We propose a novel approach to few-shot learning involving knowledge sharing between datasets with heterogeneous feature spaces.
FLAT learns low-dimensional embeddings of datasets and their individual columns, which facilitate knowledge transfer and generalization to previously unseen datasets.
A decoder network parametrizes the predictive target network, implemented as a Graph Attention Network, to accommodate the heterogeneous nature of tabular datasets.
arXiv Detail & Related papers (2023-11-16T17:45:59Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - Learning Representations without Compositional Assumptions [79.12273403390311]
We propose a data-driven approach that learns feature set dependencies by representing feature sets as graph nodes and their relationships as learnable edges.
We also introduce LEGATO, a novel hierarchical graph autoencoder that learns a smaller, latent graph to aggregate information from multiple views dynamically.
arXiv Detail & Related papers (2023-05-31T10:36:10Z) - SEMv2: Table Separation Line Detection Based on Instance Segmentation [96.36188168694781]
We propose an accurate table structure recognizer, termed SEMv2 (SEM: Split, Embed and Merge)
We address the table separation line instance-level discrimination problem and introduce a table separation line detection strategy based on conditional convolution.
To comprehensively evaluate the SEMv2, we also present a more challenging dataset for table structure recognition, dubbed iFLYTAB.
arXiv Detail & Related papers (2023-03-08T05:15:01Z) - PTab: Using the Pre-trained Language Model for Modeling Tabular Data [5.791972449406902]
Recent studies show that neural-based models are effective in learning contextual representation for Tabular data.
We propose a novel framework PTab, using the Pre-trained language model to model Tabular data.
Our method has achieved a better average AUC score in supervised settings compared to the state-of-the-art baselines.
arXiv Detail & Related papers (2022-09-15T08:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.