Scaling TabPFN: Sketching and Feature Selection for Tabular Prior-Data
Fitted Networks
- URL: http://arxiv.org/abs/2311.10609v1
- Date: Fri, 17 Nov 2023 16:04:27 GMT
- Title: Scaling TabPFN: Sketching and Feature Selection for Tabular Prior-Data
Fitted Networks
- Authors: Benjamin Feuer, Chinmay Hegde, Niv Cohen
- Abstract summary: Tabular classification has traditionally relied on supervised algorithms, which estimate the parameters of a prediction model using its training data.
Recently, Prior-Data Fitted Networks (PFNs) such as TabPFN have successfully learned to classify tabular data in-context.
While such models show great promise, their applicability to real-world data remains limited due to the computational scale needed.
- Score: 31.82225213006849
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tabular classification has traditionally relied on supervised algorithms,
which estimate the parameters of a prediction model using its training data.
Recently, Prior-Data Fitted Networks (PFNs) such as TabPFN have successfully
learned to classify tabular data in-context: the model parameters are designed
to classify new samples based on labelled training samples given after the
model training. While such models show great promise, their applicability to
real-world data remains limited due to the computational scale needed. Here we
study the following question: given a pre-trained PFN for tabular data, what is
the best way to summarize the labelled training samples before feeding them to
the model? We conduct an initial investigation of sketching and
feature-selection methods for TabPFN, and note certain key differences between
it and conventionally fitted tabular models.
Related papers
- A Data-Centric Perspective on Evaluating Machine Learning Models for Tabular Data [9.57464542357693]
This paper demonstrates that model-centric evaluations are biased, as real-world modeling pipelines often require dataset-specific preprocessing and feature engineering.
We select 10 relevant datasets from Kaggle competitions and implement expert-level preprocessing pipelines for each dataset.
After dataset-specific feature engineering, model rankings change considerably, performance differences decrease, and the importance of model selection reduces.
arXiv Detail & Related papers (2024-07-02T09:54:39Z) - TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes [25.169832192255956]
We present TabFM, a neural tabular model for data discovery over data lakes.
We finetune the pretrained model for identifying unionable, joinable, and subset table pairs.
Our results demonstrate significant improvements in F1 scores for search compared to state-of-the-art techniques.
arXiv Detail & Related papers (2024-06-28T17:28:53Z) - Tokenize features, enhancing tables: the FT-TABPFN model for tabular classification [13.481699494376809]
FT-TabPFN is an enhanced version of TabPFN that includes a novel Feature Tokenization layer to better handle classification features.
Our full source code is available for community use and development.
arXiv Detail & Related papers (2024-06-11T02:13:46Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction.
TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification.
It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
arXiv Detail & Related papers (2023-05-16T06:37:38Z) - Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language
Models [107.05966685291067]
We propose test-time prompt tuning (TPT) to learn adaptive prompts on the fly with a single test sample.
TPT improves the zero-shot top-1 accuracy of CLIP by 3.6% on average.
In evaluating cross-dataset generalization with unseen categories, TPT performs on par with the state-of-the-art approaches that use additional training data.
arXiv Detail & Related papers (2022-09-15T17:55:11Z) - PTab: Using the Pre-trained Language Model for Modeling Tabular Data [5.791972449406902]
Recent studies show that neural-based models are effective in learning contextual representation for Tabular data.
We propose a novel framework PTab, using the Pre-trained language model to model Tabular data.
Our method has achieved a better average AUC score in supervised settings compared to the state-of-the-art baselines.
arXiv Detail & Related papers (2022-09-15T08:58:42Z) - Measuring the Effect of Training Data on Deep Learning Predictions via
Randomized Experiments [5.625056584412003]
We develop a principled algorithm for estimating the contribution of training data points to a deep learning model.
Our algorithm estimates the AME, a quantity that measures the expected (average) marginal effect of adding a data point to a subset of the training data.
arXiv Detail & Related papers (2022-06-20T21:27:18Z) - Datamodels: Predicting Predictions from Training Data [86.66720175866415]
We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data.
We show that even simple linear datamodels can successfully predict model outputs.
arXiv Detail & Related papers (2022-02-01T18:15:24Z) - Explanation-Guided Training for Cross-Domain Few-Shot Classification [96.12873073444091]
Cross-domain few-shot classification task (CD-FSC) combines few-shot classification with the requirement to generalize across domains represented by datasets.
We introduce a novel training approach for existing FSC models.
We show that explanation-guided training effectively improves the model generalization.
arXiv Detail & Related papers (2020-07-17T07:28:08Z) - Pre-Trained Models for Heterogeneous Information Networks [57.78194356302626]
We propose a self-supervised pre-training and fine-tuning framework, PF-HIN, to capture the features of a heterogeneous information network.
PF-HIN consistently and significantly outperforms state-of-the-art alternatives on each of these tasks, on four datasets.
arXiv Detail & Related papers (2020-07-07T03:36:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.