UniPredict: Large Language Models are Universal Tabular Classifiers
- URL: http://arxiv.org/abs/2310.03266v2
- Date: Tue, 16 Jan 2024 20:15:18 GMT
- Title: UniPredict: Large Language Models are Universal Tabular Classifiers
- Authors: Ruiyu Wang, Zifeng Wang, Jimeng Sun
- Abstract summary: This paper exploits the idea of building universal tabular data predictors based on generative modeling, namely UniPredict.
We train a single LLM on an aggregation of 169 datasets with diverse targets and compare its performance against baselines that are trained on each dataset separately.
We observe this versatile UniPredict model demonstrates an advantage over other models, ranging from 5.4% to 13.4%, when compared with the best tree-boosting baseline and the best neural network baseline.
- Score: 33.811778526930745
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tabular data prediction is a fundamental machine learning task for many
applications. Existing methods predominantly employ discriminative modeling and
operate under the assumption of a fixed target column, necessitating
re-training for every new predictive task. Inspired by the generative power of
large language models (LLMs), this paper exploits the idea of building
universal tabular data predictors based on generative modeling, namely
UniPredict. Here, we demonstrate the scalability of an LLM to extensive tabular
datasets, enabling it to comprehend diverse tabular inputs and predict target
variables following the provided instructions. Specifically, we train a single
LLM on an aggregation of 169 tabular datasets with diverse targets and compare
its performance against baselines that are trained on each dataset separately.
We observe this versatile UniPredict model demonstrates an advantage over other
models, ranging from 5.4% to 13.4%, when compared with the best tree-boosting
baseline and the best neural network baseline, respectively. We further test
UniPredict in few-shot learning settings on another 62 tabular datasets. Our
method achieves strong performance in quickly adapting to new tasks. In
low-resource few-shot setup, we observed a 100%+ performance advantage compared
with XGBoost, and significant margin over all baselines. We envision that
UniPredict sheds light on developing a universal tabular data prediction system
that learns from data at scale and serves a wide range of prediction tasks.
Related papers
- LLM-Select: Feature Selection with Large Language Models [64.5099482021597]
Large language models (LLMs) are capable of selecting the most predictive features, with performance rivaling the standard tools of data science.
Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
arXiv Detail & Related papers (2024-07-02T22:23:40Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - Large Scale Transfer Learning for Tabular Data via Language Modeling [30.44823668480631]
We present TabuLa-8B, a language model for tabular prediction.
We use a dataset of over 2.1B rows from over 4M unique tables.
We find that TabuLa-8B has zero-shot accuracy on unseen tables that is over 15 percentage points (pp) higher than random guessing.
arXiv Detail & Related papers (2024-06-17T18:58:20Z) - Making Pre-trained Language Models Great on Tabular Prediction [50.70574370855663]
The transferability of deep neural networks (DNNs) has made significant progress in image and language processing.
We present TP-BERTa, a specifically pre-trained LM for tabular data prediction.
A novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names.
arXiv Detail & Related papers (2024-03-04T08:38:56Z) - SMUTF: Schema Matching Using Generative Tags and Hybrid Features [6.471515752693932]
SMUTF assumes that supervised learning does not affect performance in open-domain tasks.
In an innovative adaptation inspired by the Humanitarian Exchange Language, we deploy 'generative tags' for each data column.
SMUTF exhibits extensive versatility, working seamlessly with any pre-existing pre-trained embeddings, classification methods, and generative models.
arXiv Detail & Related papers (2024-01-22T08:47:50Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - From Supervised to Generative: A Novel Paradigm for Tabular Deep Learning with Large Language Models [18.219485459836285]
Generative Tabular Learning (GTL) is a novel framework that integrates the advanced functionalities of large language models (LLMs)
Our empirical study spans 384 public datasets, rigorously analyzing GTL's scaling behaviors.
GTL-LLaMA-2 model demonstrates superior zero-shot and in-context learning capabilities across numerous classification and regression tasks.
arXiv Detail & Related papers (2023-10-11T09:37:38Z) - Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners.
We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting.
Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z) - Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction.
TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification.
It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
arXiv Detail & Related papers (2023-05-16T06:37:38Z) - Ensemble Machine Learning Model Trained on a New Synthesized Dataset
Generalizes Well for Stress Prediction Using Wearable Devices [3.006016887654771]
We investigate the generalization ability of models built on datasets containing a small number of subjects, recorded in single study protocols.
We propose and evaluate the use of ensemble techniques by combining gradient boosting with an artificial neural network to measure predictive power on new, unseen data.
arXiv Detail & Related papers (2022-09-30T00:20:57Z) - Why do tree-based models still outperform deep learning on tabular data? [0.0]
We show that tree-based models remain state-of-the-art on medium-sized data.
We conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs)
arXiv Detail & Related papers (2022-07-18T08:36:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.