Related papers: UniPredict: Large Language Models are Universal Tabular Classifiers

UniPredict: Large Language Models are Universal Tabular Classifiers

URL: http://arxiv.org/abs/2310.03266v2
Date: Tue, 16 Jan 2024 20:15:18 GMT
Title: UniPredict: Large Language Models are Universal Tabular Classifiers
Authors: Ruiyu Wang, Zifeng Wang, Jimeng Sun
Abstract summary: This paper exploits the idea of building universal tabular data predictors based on generative modeling, namely UniPredict. We train a single LLM on an aggregation of 169 datasets with diverse targets and compare its performance against baselines that are trained on each dataset separately. We observe this versatile UniPredict model demonstrates an advantage over other models, ranging from 5.4% to 13.4%, when compared with the best tree-boosting baseline and the best neural network baseline.
Score: 33.811778526930745
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tabular data prediction is a fundamental machine learning task for many applications. Existing methods predominantly employ discriminative modeling and operate under the assumption of a fixed target column, necessitating re-training for every new predictive task. Inspired by the generative power of large language models (LLMs), this paper exploits the idea of building universal tabular data predictors based on generative modeling, namely UniPredict. Here, we demonstrate the scalability of an LLM to extensive tabular datasets, enabling it to comprehend diverse tabular inputs and predict target variables following the provided instructions. Specifically, we train a single LLM on an aggregation of 169 tabular datasets with diverse targets and compare its performance against baselines that are trained on each dataset separately. We observe this versatile UniPredict model demonstrates an advantage over other models, ranging from 5.4% to 13.4%, when compared with the best tree-boosting baseline and the best neural network baseline, respectively. We further test UniPredict in few-shot learning settings on another 62 tabular datasets. Our method achieves strong performance in quickly adapting to new tasks. In low-resource few-shot setup, we observed a 100%+ performance advantage compared with XGBoost, and significant margin over all baselines. We envision that UniPredict sheds light on developing a universal tabular data prediction system that learns from data at scale and serves a wide range of prediction tasks.

Related papers

SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z)
Make Still Further Progress: Chain of Thoughts for Tabular Data Leaderboard [27.224577475861214]
Tabular data, a fundamental data format in machine learning, is predominantly utilized in competitions and real-world applications.<n>We propose an in-context ensemble framework for tabular prediction that leverages large language models.<n>Our method constructs a context around each test instance using its nearest neighbors and the predictions from a pool of external models.
arXiv Detail & Related papers (2025-05-19T17:52:58Z)
Representation Learning for Tabular Data: A Comprehensive Survey [23.606506938919605]
Tabular data, structured as rows and columns, is among the most prevalent data types in machine learning classification and regression applications. Deep Neural Networks (DNNs) have recently demonstrated promising results through their capability of representation learning. We organize existing methods into three main categories according to their generalization capabilities.
arXiv Detail & Related papers (2025-04-17T17:58:23Z)
Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks. We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z)
Zero-shot Meta-learning for Tabular Prediction Tasks with Adversarially Pre-trained Transformer [2.1677183904102257]
We present an Adversarially Pre-trained Transformer (APT) that is able to perform zero-shot meta-learning on tabular prediction tasks without pre-training on any real-world dataset. APT is pre-trained with adversarial synthetic data agents, who deliberately challenge the model with different synthetic datasets. We show that our framework matches state-of-the-art performance on small classification tasks without filtering on dataset characteristics.
arXiv Detail & Related papers (2025-02-06T23:58:11Z)
TabDPT: Scaling Tabular Foundation Models on Real Data [20.00390825519329]
We propose an approach to combine ICL-based retrieval with self supervised learning to train foundation models.<n>We show that incorporating real data during the pre-training phase can lead to significantly faster training and better generalization to unseen data.<n>Our resulting model, TabDPT, achieves top performance on both regression (CTR23) and classification (CC18) benchmarks.
arXiv Detail & Related papers (2024-10-23T18:00:00Z)
LLM-Select: Feature Selection with Large Language Models [64.5099482021597]
Large language models (LLMs) are capable of selecting the most predictive features, with performance rivaling the standard tools of data science. Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
arXiv Detail & Related papers (2024-07-02T22:23:40Z)
LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets. LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets. We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z)
Large Scale Transfer Learning for Tabular Data via Language Modeling [30.44823668480631]
We present TabuLa-8B, a language model for tabular prediction. We use a dataset of over 2.1B rows from over 4M unique tables. We find that TabuLa-8B has zero-shot accuracy on unseen tables that is over 15 percentage points (pp) higher than random guessing.
arXiv Detail & Related papers (2024-06-17T18:58:20Z)
Making Pre-trained Language Models Great on Tabular Prediction [50.70574370855663]
The transferability of deep neural networks (DNNs) has made significant progress in image and language processing. We present TP-BERTa, a specifically pre-trained LM for tabular data prediction. A novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names.
arXiv Detail & Related papers (2024-03-04T08:38:56Z)
SMUTF: Schema Matching Using Generative Tags and Hybrid Features [6.471515752693932]
SMUTF assumes that supervised learning does not affect performance in open-domain tasks. In an innovative adaptation inspired by the Humanitarian Exchange Language, we deploy 'generative tags' for each data column. SMUTF exhibits extensive versatility, working seamlessly with any pre-existing pre-trained embeddings, classification methods, and generative models.
arXiv Detail & Related papers (2024-01-22T08:47:50Z)
Training-Free Generalization on Heterogeneous Tabular Data via Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM) A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences. Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z)
From Supervised to Generative: A Novel Paradigm for Tabular Deep Learning with Large Language Models [18.219485459836285]
Generative Tabular Learning (GTL) is a novel framework that integrates the advanced functionalities of large language models (LLMs) Our empirical study spans 384 public datasets, rigorously analyzing GTL's scaling behaviors. GTL-LLaMA-2 model demonstrates superior zero-shot and in-context learning capabilities across numerous classification and regression tasks.
arXiv Detail & Related papers (2023-10-11T09:37:38Z)
Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners. We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting. Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z)
Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction. TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification. It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
arXiv Detail & Related papers (2023-05-16T06:37:38Z)
Ensemble Machine Learning Model Trained on a New Synthesized Dataset Generalizes Well for Stress Prediction Using Wearable Devices [3.006016887654771]
We investigate the generalization ability of models built on datasets containing a small number of subjects, recorded in single study protocols. We propose and evaluate the use of ensemble techniques by combining gradient boosting with an artificial neural network to measure predictive power on new, unseen data.
arXiv Detail & Related papers (2022-09-30T00:20:57Z)
Why do tree-based models still outperform deep learning on tabular data? [0.0]
We show that tree-based models remain state-of-the-art on medium-sized data. We conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs)
arXiv Detail & Related papers (2022-07-18T08:36:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.