Squeezing Lemons with Hammers: An Evaluation of AutoML and Tabular Deep Learning for Data-Scarce Classification Applications
- URL: http://arxiv.org/abs/2405.07662v1
- Date: Mon, 13 May 2024 11:43:38 GMT
- Title: Squeezing Lemons with Hammers: An Evaluation of AutoML and Tabular Deep Learning for Data-Scarce Classification Applications
- Authors: Ricardo Knauer, Erik Rodner,
- Abstract summary: We find that L2-regularized logistic regression performs similar to state-of-the-art automated machine learning (AutoML) frameworks.
We recommend to consider logistic regression as the first choice for data-scarce applications.
- Score: 2.663744975320783
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many industry verticals are confronted with small-sized tabular data. In this low-data regime, it is currently unclear whether the best performance can be expected from simple baselines, or more complex machine learning approaches that leverage meta-learning and ensembling. On 44 tabular classification datasets with sample sizes $\leq$ 500, we find that L2-regularized logistic regression performs similar to state-of-the-art automated machine learning (AutoML) frameworks (AutoPrognosis, AutoGluon) and off-the-shelf deep neural networks (TabPFN, HyperFast) on the majority of the benchmark datasets. We therefore recommend to consider logistic regression as the first choice for data-scarce applications with tabular data and provide practitioners with best practices for further method selection.
Related papers
- forester: A Tree-Based AutoML Tool in R [0.0]
The forester is an open-source AutoML package implemented in R for training high-quality tree-based models.
It fully supports binary and multiclass classification, regression, and partially survival analysis tasks.
With just a few functions, the user is capable of detecting issues regarding the data quality, preparing the preprocessing pipeline, training and tuning tree-based models, evaluating the results, and creating the report for further analysis.
arXiv Detail & Related papers (2024-09-07T10:39:10Z) - PMLBmini: A Tabular Classification Benchmark Suite for Data-Scarce Applications [2.3700911865675187]
PMLBmini is a benchmark suite of 44 binary classification datasets with sample sizes $leq$ 500.
We use our suite to thoroughly evaluate current automated machine learning (AutoML) frameworks.
Our analysis reveals that state-of-the-art AutoML and deep learning approaches often fail to appreciably outperform even a simple logistic regression baseline.
arXiv Detail & Related papers (2024-09-03T06:13:03Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - TabRepo: A Large Scale Repository of Tabular Model Evaluations and its AutoML Applications [9.457938949410583]
TabRepo is a new dataset of model evaluations and predictions.
It contains the predictions and metrics of 1310 models evaluated on 200 datasets.
arXiv Detail & Related papers (2023-11-06T09:17:18Z) - Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval [129.25914272977542]
RetoMaton is a weighted finite automaton built on top of the datastore.
Traversing this automaton at inference time, in parallel to the LM inference, reduces its perplexity.
arXiv Detail & Related papers (2022-01-28T21:38:56Z) - Benchmarking Multimodal AutoML for Tabular Data with Text Fields [83.43249184357053]
We assemble 18 multimodal data tables that each contain some text fields.
Our benchmark enables researchers to evaluate their own methods for supervised learning with numeric, categorical, and text features.
arXiv Detail & Related papers (2021-11-04T09:29:16Z) - AutoRec: An Automated Recommender System [44.11798716678736]
We present AutoRec, an open-source automated machine learning (AutoML) platform extended from the ecosystem.
AutoRec supports a highly flexible pipeline that accommodates both sparse and dense inputs.
Experiments conducted on the benchmark datasets reveal AutoRec is reliable and can identify models which resemble the best model without prior knowledge.
arXiv Detail & Related papers (2020-06-26T17:04:53Z) - Fast, Accurate, and Simple Models for Tabular Data via Augmented
Distillation [97.42894942391575]
We propose FAST-DAD to distill arbitrarily complex ensemble predictors into individual models like boosted trees, random forests, and deep networks.
Our individual distilled models are over 10x faster and more accurate than ensemble predictors produced by AutoML tools like H2O/AutoSklearn.
arXiv Detail & Related papers (2020-06-25T09:57:47Z) - Auto-PyTorch Tabular: Multi-Fidelity MetaLearning for Efficient and
Robust AutoDL [53.40030379661183]
Auto-PyTorch is a framework to enable fully automated deep learning (AutoDL)
It combines multi-fidelity optimization with portfolio construction for warmstarting and ensembling of deep neural networks (DNNs)
We show that Auto-PyTorch performs better than several state-of-the-art competitors on average.
arXiv Detail & Related papers (2020-06-24T15:15:17Z) - AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data [120.2298620652828]
We introduce AutoGluon-Tabular, an open-source AutoML framework that requires only a single line of Python to train highly accurate machine learning models.
Tests on a suite of 50 classification and regression tasks from Kaggle and the OpenML AutoML Benchmark reveal that AutoGluon is faster, more robust, and much more accurate.
arXiv Detail & Related papers (2020-03-13T23:10:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.