Cross-Table Pretraining towards a Universal Function Space for Heterogeneous Tabular Data
- URL: http://arxiv.org/abs/2406.00281v1
- Date: Sat, 1 Jun 2024 03:24:31 GMT
- Title: Cross-Table Pretraining towards a Universal Function Space for Heterogeneous Tabular Data
- Authors: Jintai Chen, Zhen Lin, Qiyuan Chen, Jimeng Sun,
- Abstract summary: Cross-dataset pretraining has shown notable success in various fields.
In this study, we introduce a cross-table pretrained Transformer, XTFormer, for versatile downstream tabular prediction tasks.
Our methodology is pretraining XTFormer to establish a "meta-function" space that encompasses all potential feature-target mappings.
- Score: 35.61663559675556
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tabular data from different tables exhibit significant diversity due to varied definitions and types of features, as well as complex inter-feature and feature-target relationships. Cross-dataset pretraining, which learns reusable patterns from upstream data to support downstream tasks, have shown notable success in various fields. Yet, when applied to tabular data prediction, this paradigm faces challenges due to the limited reusable patterns among diverse tabular datasets (tables) and the general scarcity of tabular data available for fine-tuning. In this study, we fill this gap by introducing a cross-table pretrained Transformer, XTFormer, for versatile downstream tabular prediction tasks. Our methodology insight is pretraining XTFormer to establish a "meta-function" space that encompasses all potential feature-target mappings. In pre-training, a variety of potential mappings are extracted from pre-training tabular datasets and are embedded into the "meta-function" space, and suited mappings are extracted from the "meta-function" space for downstream tasks by a specified coordinate positioning approach. Experiments show that, in 190 downstream tabular prediction tasks, our cross-table pretrained XTFormer wins both XGBoost and Catboost on 137 (72%) tasks, and surpasses representative deep learning models FT-Transformer and the tabular pre-training approach XTab on 144 (76%) and 162 (85%) tasks.
Related papers
- TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all multi-modal distributions of tabular data in one model.
Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data.
TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - Making Pre-trained Language Models Great on Tabular Prediction [50.70574370855663]
The transferability of deep neural networks (DNNs) has made significant progress in image and language processing.
We present TP-BERTa, a specifically pre-trained LM for tabular data prediction.
A novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names.
arXiv Detail & Related papers (2024-03-04T08:38:56Z) - Tabular Few-Shot Generalization Across Heterogeneous Feature Spaces [43.67453625260335]
We propose a novel approach to few-shot learning involving knowledge sharing between datasets with heterogeneous feature spaces.
FLAT learns low-dimensional embeddings of datasets and their individual columns, which facilitate knowledge transfer and generalization to previously unseen datasets.
A decoder network parametrizes the predictive target network, implemented as a Graph Attention Network, to accommodate the heterogeneous nature of tabular datasets.
arXiv Detail & Related papers (2023-11-16T17:45:59Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement [44.693325083735424]
Tabular data prediction has been employed in medical applications such as patient health risk prediction.
Previous predictors are often trained on manually curated small datasets.
arXiv Detail & Related papers (2023-05-20T03:37:09Z) - Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction.
TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification.
It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
arXiv Detail & Related papers (2023-05-16T06:37:38Z) - XTab: Cross-table Pretraining for Tabular Transformers [29.419276738753968]
XTab is a framework for cross-table pretraining of tabular transformers on datasets from various domains.
We show that XTab consistently boosts the generalizability, learning speed, and performance of multiple tabular transformers.
We achieve superior performance than other state-of-the-art tabular deep learning models on various tasks such as regression, binary, and multiclass classification.
arXiv Detail & Related papers (2023-05-10T12:17:52Z) - Learning Enhanced Representations for Tabular Data via Neighborhood
Propagation [24.485479610138498]
We construct a hypergraph to model the cross-row and cross-column patterns of data instances.
We then perform message propagation to enhance the target data instance representation.
Experiments on two important data prediction tasks validate the superiority of the proposed PET model.
arXiv Detail & Related papers (2022-06-14T04:24:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.