TabRet: Pre-training Transformer-based Tabular Models for Unseen Columns
- URL: http://arxiv.org/abs/2303.15747v4
- Date: Sun, 16 Apr 2023 03:42:52 GMT
- Title: TabRet: Pre-training Transformer-based Tabular Models for Unseen Columns
- Authors: Soma Onishi, Kenta Oono, and Kohei Hayashi
- Abstract summary: TabRet is designed to work on a downstream task that contains columns not seen in pre-training.
In experiments, we pre-trained TabRet with a large collection of public health surveys and fine-tuned it on classification tasks in healthcare.
In addition, an ablation study shows retokenizing and random shuffle augmentation of columns during pre-training contributed to performance gains.
- Score: 12.139158398361866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present \emph{TabRet}, a pre-trainable Transformer-based model for tabular
data. TabRet is designed to work on a downstream task that contains columns not
seen in pre-training. Unlike other methods, TabRet has an extra learning step
before fine-tuning called \emph{retokenizing}, which calibrates feature
embeddings based on the masked autoencoding loss. In experiments, we
pre-trained TabRet with a large collection of public health surveys and
fine-tuned it on classification tasks in healthcare, and TabRet achieved the
best AUC performance on four datasets. In addition, an ablation study shows
retokenizing and random shuffle augmentation of columns during pre-training
contributed to performance gains. The code is available at
https://github.com/pfnet-research/tabret .
Related papers
- TabDPT: Scaling Tabular Foundation Models [20.00390825519329]
We show how to harness the power of real data to improve performance and generalization.
Our model achieves state-of-the-art performance on the CC18 (classification) and CTR23 (regression) benchmarks.
TabDPT also demonstrates strong scaling as both model size and amount of available data increase.
arXiv Detail & Related papers (2024-10-23T18:00:00Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - Tokenize features, enhancing tables: the FT-TABPFN model for tabular classification [13.481699494376809]
FT-TabPFN is an enhanced version of TabPFN that includes a novel Feature Tokenization layer to better handle classification features.
Our full source code is available for community use and development.
arXiv Detail & Related papers (2024-06-11T02:13:46Z) - TabMDA: Tabular Manifold Data Augmentation for Any Classifier using Transformers with In-context Subsetting [23.461204546005387]
TabMDA is a novel method for manifold data augmentation on tabular data.
It exploits a pre-trained in-context model, such as TabPFN, to map the data into an embedding space.
We evaluate TabMDA on five standard classifiers and observe significant performance improvements across various datasets.
arXiv Detail & Related papers (2024-06-03T21:51:13Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model
in Data Science [16.384705926693073]
This study seeks to extend the power of pretraining methodologies to facilitate the prediction over tables in data science.
We introduce UniTabE, a method designed to process tables in a uniform manner, devoid of constraints imposed by specific table structures.
In order to implement the pretraining phase, we curated an expansive dataset comprising approximately 13B samples, meticulously gathered from the Kaggle platform.
arXiv Detail & Related papers (2023-07-18T13:28:31Z) - Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction.
TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification.
It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
arXiv Detail & Related papers (2023-05-16T06:37:38Z) - XTab: Cross-table Pretraining for Tabular Transformers [29.419276738753968]
XTab is a framework for cross-table pretraining of tabular transformers on datasets from various domains.
We show that XTab consistently boosts the generalizability, learning speed, and performance of multiple tabular transformers.
We achieve superior performance than other state-of-the-art tabular deep learning models on various tasks such as regression, binary, and multiclass classification.
arXiv Detail & Related papers (2023-05-10T12:17:52Z) - A Memory Transformer Network for Incremental Learning [64.0410375349852]
We study class-incremental learning, a training setup in which new classes of data are observed over time for the model to learn from.
Despite the straightforward problem formulation, the naive application of classification models to class-incremental learning results in the "catastrophic forgetting" of previously seen classes.
One of the most successful existing methods has been the use of a memory of exemplars, which overcomes the issue of catastrophic forgetting by saving a subset of past data into a memory bank and utilizing it to prevent forgetting when training future tasks.
arXiv Detail & Related papers (2022-10-10T08:27:28Z) - TabPFN: A Transformer That Solves Small Tabular Classification Problems
in a Second [48.87527918630822]
We present TabPFN, a trained Transformer that can do supervised classification for small datasets in less than a second.
TabPFN performs in-context learning (ICL), it learns to make predictions using sequences of labeled examples.
We show that our method clearly outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with up to 230$times$ speedup.
arXiv Detail & Related papers (2022-07-05T07:17:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.