TransTab: Learning Transferable Tabular Transformers Across Tables
- URL: http://arxiv.org/abs/2205.09328v1
- Date: Thu, 19 May 2022 05:34:46 GMT
- Title: TransTab: Learning Transferable Tabular Transformers Across Tables
- Authors: Zifeng Wang, Jimeng Sun
- Abstract summary: Tabular data (or tables) are the most widely used data format in machine learning (ML)
heavy data cleaning is required to merge disparate tables with different columns.
TransTab converts each sample (a row in the table) to a generalizable embedding vector.
- Score: 42.859662256134584
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tabular data (or tables) are the most widely used data format in machine
learning (ML). However, ML models often assume the table structure keeps fixed
in training and testing. Before ML modeling, heavy data cleaning is required to
merge disparate tables with different columns. This preprocessing often incurs
significant data waste (e.g., removing unmatched columns and samples). How to
learn ML models from multiple tables with partially overlapping columns? How to
incrementally update ML models as more columns become available over time? Can
we leverage model pretraining on multiple distinct tables? How to train an ML
model which can predict on an unseen table?
To answer all those questions, we propose to relax fixed table structures by
introducing a Transferable Tabular Transformer (TransTab) for tables. The goal
of TransTab is to convert each sample (a row in the table) to a generalizable
embedding vector, and then apply stacked transformers for feature encoding. One
methodology insight is combining column description and table cells as the raw
input to a gated transformer model. The other insight is to introduce
supervised and self-supervised pretraining to improve model performance. We
compare TransTab with multiple baseline methods on diverse benchmark datasets
and five oncology clinical trial datasets. Overall, TransTab ranks 1.00, 1.00,
1.78 out of 12 methods in supervised learning, feature incremental learning,
and transfer learning scenarios, respectively; and the proposed pretraining
leads to 2.3\% AUC lift on average over the supervised learning.}
Related papers
- Tabular Transfer Learning via Prompting LLMs [52.96022335067357]
We propose a novel framework, Prompt to Transfer (P2T), that utilizes unlabeled (or heterogeneous) source data with large language models (LLMs)
P2T identifies a column feature in a source dataset that is strongly correlated with a target task feature to create examples relevant to the target task, thus creating pseudo-demonstrations for prompts.
arXiv Detail & Related papers (2024-08-09T11:30:52Z) - Multimodal Table Understanding [26.652797853893233]
How to directly understand tables using intuitive visual information is a crucial and urgent challenge for developing more practical applications.
We propose a new problem, multimodal table understanding, where the model needs to generate correct responses to various table-related requests.
We develop Table-LLaVA, a generalist multimodal large language model (MLLM), which significantly outperforms recent open-source MLLM baselines on 23 benchmarks.
arXiv Detail & Related papers (2024-06-12T11:27:03Z) - Deep Learning with Tabular Data: A Self-supervised Approach [0.0]
We have used a self-supervised learning approach in this study.
The aim is to find the most effective TabTransformer model representation of categorical and numerical features.
The research has presented with a novel approach by creating various variants of TabTransformer model.
arXiv Detail & Related papers (2024-01-26T23:12:41Z) - TAP4LLM: Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning [55.33939289989238]
We propose TAP4LLM as a versatile pre-processor suite for leveraging large language models (LLMs) in table-based tasks effectively.
It covers several distinct components: (1) table sampling to decompose large tables into manageable sub-tables based on query semantics, (2) table augmentation to enhance tables with additional knowledge from external sources or models, and (3) table packing & serialization to convert tables into various formats suitable for LLMs' understanding.
arXiv Detail & Related papers (2023-12-14T15:37:04Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - Retrieval-Based Transformer for Table Augmentation [14.460363647772745]
We introduce a novel approach toward automatic data wrangling.
We aim to address table augmentation tasks, including row/column population and data imputation.
Our model consistently and substantially outperforms both supervised statistical methods and the current state-of-the-art transformer-based models.
arXiv Detail & Related papers (2023-06-20T18:51:21Z) - XTab: Cross-table Pretraining for Tabular Transformers [29.419276738753968]
XTab is a framework for cross-table pretraining of tabular transformers on datasets from various domains.
We show that XTab consistently boosts the generalizability, learning speed, and performance of multiple tabular transformers.
We achieve superior performance than other state-of-the-art tabular deep learning models on various tasks such as regression, binary, and multiclass classification.
arXiv Detail & Related papers (2023-05-10T12:17:52Z) - TabRet: Pre-training Transformer-based Tabular Models for Unseen Columns [12.139158398361866]
TabRet is designed to work on a downstream task that contains columns not seen in pre-training.
In experiments, we pre-trained TabRet with a large collection of public health surveys and fine-tuned it on classification tasks in healthcare.
In addition, an ablation study shows retokenizing and random shuffle augmentation of columns during pre-training contributed to performance gains.
arXiv Detail & Related papers (2023-03-28T06:03:41Z) - OmniTab: Pretraining with Natural and Synthetic Data for Few-shot
Table-based Question Answering [106.73213656603453]
We develop a simple table-based QA model with minimal annotation effort.
We propose an omnivorous pretraining approach that consumes both natural and synthetic data.
arXiv Detail & Related papers (2022-07-08T01:23:45Z) - Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach.
IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.