Why In-Context Learning Transformers are Tabular Data Classifiers
- URL: http://arxiv.org/abs/2405.13396v1
- Date: Wed, 22 May 2024 07:13:55 GMT
- Title: Why In-Context Learning Transformers are Tabular Data Classifiers
- Authors: Felix den Breejen, Sangmin Bae, Stephen Cha, Se-Young Yun,
- Abstract summary: We show that ICL-transformers acquire the ability to create complex decision boundaries during pretraining.
We create TabForestPFN, the ICL-transformer pretrained on both the original TabPFN synthetic dataset generator and our forest dataset generator.
- Score: 22.33649426762373
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recently introduced TabPFN pretrains an In-Context Learning (ICL) transformer on synthetic data to perform tabular data classification. As synthetic data does not share features or labels with real-world data, the underlying mechanism that contributes to the success of this method remains unclear. This study provides an explanation by demonstrating that ICL-transformers acquire the ability to create complex decision boundaries during pretraining. To validate our claim, we develop a novel forest dataset generator which creates datasets that are unrealistic, but have complex decision boundaries. Our experiments confirm the effectiveness of ICL-transformers pretrained on this data. Furthermore, we create TabForestPFN, the ICL-transformer pretrained on both the original TabPFN synthetic dataset generator and our forest dataset generator. By fine-tuning this model, we reach the current state-of-the-art on tabular data classification. Code is available at https://github.com/FelixdenBreejen/TabForestPFN.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - In-Context In-Context Learning with Transformer Neural Processes [50.57807892496024]
We develop the in-context in-context learning pseudo-token TNP (ICICL-TNP)
The ICICL-TNP is capable of conditioning on both sets of datapoints and sets of datasets, enabling it to perform in-context in-context learning.
We demonstrate the importance of in-context in-context learning and the effectiveness of the ICICL-TNP in a number of experiments.
arXiv Detail & Related papers (2024-06-19T12:26:36Z) - TabMT: Generating tabular data with masked transformers [0.0]
Masked Transformers are incredibly effective as generative models and classifiers.
This work contributes to the exploration of transformer-based models in synthetic data generation for diverse application domains.
arXiv Detail & Related papers (2023-12-11T03:28:11Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction.
TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification.
It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
arXiv Detail & Related papers (2023-05-16T06:37:38Z) - REaLTabFormer: Generating Realistic Relational and Tabular Data using
Transformers [0.0]
We introduce REaLTabFormer (Realistic and Tabular Transformer), a synthetic data generation model.
It first creates a parent table using an autoregressive GPT-2 model, then generates the relational dataset conditioned on the parent table using a sequence-to-sequence model.
Experiments using real-world datasets show that REaLTabFormer captures the relational structure better than a model baseline.
arXiv Detail & Related papers (2023-02-04T00:32:50Z) - TabPFN: A Transformer That Solves Small Tabular Classification Problems
in a Second [48.87527918630822]
We present TabPFN, a trained Transformer that can do supervised classification for small datasets in less than a second.
TabPFN performs in-context learning (ICL), it learns to make predictions using sequences of labeled examples.
We show that our method clearly outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with up to 230$times$ speedup.
arXiv Detail & Related papers (2022-07-05T07:17:43Z) - Deep Transformer Networks for Time Series Classification: The NPP Safety
Case [59.20947681019466]
An advanced temporal neural network referred to as the Transformer is used within a supervised learning fashion to model the time-dependent NPP simulation data.
The Transformer can learn the characteristics of the sequential data and yield promising performance with approximately 99% classification accuracy on the testing dataset.
arXiv Detail & Related papers (2021-04-09T14:26:25Z) - Tabular Transformers for Modeling Multivariate Time Series [30.717890753132824]
Tabular datasets are ubiquitous in data science applications. Given their importance, it seems natural to apply state-of-the-art deep learning algorithms in order to fully unlock their potential.
Here we propose neural network models that represent tabular time series that can leverage their hierarchical structure.
We demonstrate our models on two datasets: a synthetic credit card transaction dataset, where the learned representations are used for fraud detection and synthetic data generation, and on a real pollution dataset, where the learned encodings are used to predict atmospheric pollutant concentrations.
arXiv Detail & Related papers (2020-11-03T16:58:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.