UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model
in Data Science
- URL: http://arxiv.org/abs/2307.09249v2
- Date: Wed, 13 Mar 2024 08:20:34 GMT
- Title: UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model
in Data Science
- Authors: Yazheng Yang, Yuqi Wang, Guang Liu, Ledell Wu, Qi Liu
- Abstract summary: This study seeks to extend the power of pretraining methodologies to facilitate the prediction over tables in data science.
We introduce UniTabE, a method designed to process tables in a uniform manner, devoid of constraints imposed by specific table structures.
In order to implement the pretraining phase, we curated an expansive dataset comprising approximately 13B samples, meticulously gathered from the Kaggle platform.
- Score: 16.384705926693073
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advancements in NLP have witnessed the groundbreaking impact of
pretrained models, yielding impressive outcomes across various tasks. This
study seeks to extend the power of pretraining methodologies to facilitating
the prediction over tables in data science, a domain traditionally overlooked,
yet inherently challenging due to the plethora of table schemas intrinsic to
different tasks. The primary research questions underpinning this work revolve
around the establishment of a universal pretraining protocol for tables with
varied structures, the generalizability and transferability of learned
knowledge across tasks, the adaptation to diverse downstream applications, and
the incorporation of incremental columns over time. In response to these
challenges, we introduce UniTabE, a straightforward yet effective method
designed to process tables in a uniform manner, devoid of constraints imposed
by specific table structures. UniTabE's core concept relies on representing
each basic table element with a module, termed TabUnit. This is subsequently
followed by a Transformer encoder to refine the representation. Moreover, our
model is designed to facilitate pretraining and finetuning through the
utilization of free-form prompts. In order to implement the pretraining phase,
we curated an expansive tabular dataset comprising approximately 13B samples,
meticulously gathered from the Kaggle platform. This research primarily centers
on classification and regression tasks involving tabular data, and conducts
rigorous experimental testing and analyses to validate the effectiveness of our
methodology. The experimental results demonstrate UniTabE's superior
performance against several baselines across massive benchmarks. This,
therefore, underscores UniTabE's potential to significantly enhance the
semantic representation of tabular data, thereby marking a significant stride
for tabular data analysis.
Related papers
- Towards Better Understanding Table Instruction Tuning: Decoupling the Effects from Data versus Models [62.47618742274461]
We fine-tune base models from the Mistral, OLMo, and Phi families on existing public training datasets.
Our replication achieves performance on par with or surpassing existing table LLMs.
We decouple the contributions of training data and the base model, providing insight into their individual impacts.
arXiv Detail & Related papers (2025-01-24T18:50:26Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - Making Pre-trained Language Models Great on Tabular Prediction [50.70574370855663]
The transferability of deep neural networks (DNNs) has made significant progress in image and language processing.
We present TP-BERTa, a specifically pre-trained LM for tabular data prediction.
A novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names.
arXiv Detail & Related papers (2024-03-04T08:38:56Z) - Rethinking Pre-Training in Tabular Data: A Neighborhood Embedding Perspective [71.45945607871715]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
The core idea is to embed data instances into a shared feature space, where each instance is represented by its distance to a fixed number of nearest neighbors and their labels.
Extensive experiments on 101 datasets confirm TabPTM's effectiveness in both classification and regression tasks, with and without fine-tuning.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - Towards Cross-Table Masked Pretraining for Web Data Mining [22.952238405240188]
We propose an innovative, generic, and efficient cross-table pretraining framework, dubbed as CM2.
Our experiments demonstrate CM2's state-of-the-art performance and validate that cross-table pretraining can enhance various downstream tasks.
arXiv Detail & Related papers (2023-07-10T02:27:38Z) - Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction.
TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification.
It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
arXiv Detail & Related papers (2023-05-16T06:37:38Z) - XTab: Cross-table Pretraining for Tabular Transformers [29.419276738753968]
XTab is a framework for cross-table pretraining of tabular transformers on datasets from various domains.
We show that XTab consistently boosts the generalizability, learning speed, and performance of multiple tabular transformers.
We achieve superior performance than other state-of-the-art tabular deep learning models on various tasks such as regression, binary, and multiclass classification.
arXiv Detail & Related papers (2023-05-10T12:17:52Z) - STUNT: Few-shot Tabular Learning with Self-generated Tasks from
Unlabeled Tables [64.0903766169603]
We propose a framework for few-shot semi-supervised learning, coined Self-generated Tasks from UNlabeled Tables (STUNT)
Our key idea is to self-generate diverse few-shot tasks by treating randomly chosen columns as a target label.
We then employ a meta-learning scheme to learn generalizable knowledge with the constructed tasks.
arXiv Detail & Related papers (2023-03-02T02:37:54Z) - Table Pre-training: A Survey on Model Architectures, Pretraining
Objectives, and Downstream Tasks [37.35651138851127]
A flurry of table pre-training frameworks have been proposed following the success of text and images.
Table pre-training usually takes the form of table-text joint pre-training.
This survey aims to provide a comprehensive review of different model designs, pre-training objectives, and downstream tasks for table pre-training.
arXiv Detail & Related papers (2022-01-24T15:22:24Z) - SubTab: Subsetting Features of Tabular Data for Self-Supervised
Representation Learning [5.5616364225463055]
We introduce a new framework, Subsetting features of Tabular data (SubTab)
In this paper, we introduce a new framework, Subsetting features of Tabular data (SubTab)
We argue that reconstructing the data from the subset of its features rather than its corrupted version in an autoencoder setting can better capture its underlying representation.
arXiv Detail & Related papers (2021-10-08T20:11:09Z) - TABBIE: Pretrained Representations of Tabular Data [22.444607481407633]
We devise a simple pretraining objective that learns exclusively from tabular data.
Unlike competing approaches, our model (TABBIE) provides embeddings of all table substructures.
A qualitative analysis of our model's learned cell, column, and row representations shows that it understands complex table semantics and numerical trends.
arXiv Detail & Related papers (2021-05-06T11:15:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.