Pre-Hoc Predictions in AutoML: Leveraging LLMs to Enhance Model Selection and Benchmarking for Tabular datasets
- URL: http://arxiv.org/abs/2510.01842v1
- Date: Thu, 02 Oct 2025 09:37:12 GMT
- Title: Pre-Hoc Predictions in AutoML: Leveraging LLMs to Enhance Model Selection and Benchmarking for Tabular datasets
- Authors: Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, Sachin Sharma, John D. Kelleher,
- Abstract summary: This paper explores the intersection of AutoML and pre-hoc model selection.<n>By leveraging traditional models and Large Language Model (LLM) agents, we reduce the AutoML search space.<n>The proposed approach offers a shift in AutoML, significantly reducing computational overhead, while still selecting the best model for the given dataset.
- Score: 4.960830501953278
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The field of AutoML has made remarkable progress in post-hoc model selection, with libraries capable of automatically identifying the most performing models for a given dataset. Nevertheless, these methods often rely on exhaustive hyperparameter searches, where methods automatically train and test different types of models on the target dataset. Contrastingly, pre-hoc prediction emerges as a promising alternative, capable of bypassing exhaustive search through intelligent pre-selection of models. Despite its potential, pre-hoc prediction remains under-explored in the literature. This paper explores the intersection of AutoML and pre-hoc model selection by leveraging traditional models and Large Language Model (LLM) agents to reduce the search space of AutoML libraries. By relying on dataset descriptions and statistical information, we reduce the AutoML search space. Our methodology is applied to the AWS AutoGluon portfolio dataset, a state-of-the-art AutoML benchmark containing 175 tabular classification datasets available on OpenML. The proposed approach offers a shift in AutoML workflows, significantly reducing computational overhead, while still selecting the best model for the given dataset.
Related papers
- Robustness of AutoML on Dirty Categorical Data [10.798536038901903]
The goal of automated machine learning (AutoML) is to reduce trial and error when doing machine learning (ML)<n>Recent research has shown that ML models can benefit from morphological encoders for dirty categorical data, leading to superior predictive performance.<n>We propose a pipeline that transforms categorical data into numerical data so that an AutoML can handle categorical data transformed by more advanced encoding schemes.
arXiv Detail & Related papers (2026-01-31T00:05:59Z) - Empowering Time Series Forecasting with LLM-Agents [23.937463131291974]
We propose DCATS, a Data-Centric Agent for Time Series.<n>We evaluate DCATS using four time series forecasting models on a large-scale traffic volume forecasting dataset.
arXiv Detail & Related papers (2025-08-06T09:14:08Z) - Efficient Model Selection for Time Series Forecasting via LLMs [52.31535714387368]
We propose to leverage Large Language Models (LLMs) as a lightweight alternative for model selection.<n>Our method eliminates the need for explicit performance matrices by utilizing the inherent knowledge and reasoning capabilities of LLMs.
arXiv Detail & Related papers (2025-04-02T20:33:27Z) - AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML [56.565200973244146]
Automated machine learning (AutoML) accelerates AI development by automating tasks in the development pipeline.<n>Recent works have started exploiting large language models (LLM) to lessen such burden.<n>This paper proposes AutoML-Agent, a novel multi-agent framework tailored for full-pipeline AutoML.
arXiv Detail & Related papers (2024-10-03T20:01:09Z) - LLM-Select: Feature Selection with Large Language Models [64.5099482021597]
Large language models (LLMs) are capable of selecting the most predictive features, with performance rivaling the standard tools of data science.<n>Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
arXiv Detail & Related papers (2024-07-02T22:23:40Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval [129.25914272977542]
RetoMaton is a weighted finite automaton built on top of the datastore.
Traversing this automaton at inference time, in parallel to the LM inference, reduces its perplexity.
arXiv Detail & Related papers (2022-01-28T21:38:56Z) - Automatic Componentwise Boosting: An Interpretable AutoML System [1.1709030738577393]
We propose an AutoML system that constructs an interpretable additive model that can be fitted using a highly scalable componentwise boosting algorithm.
Our system provides tools for easy model interpretation such as visualizing partial effects and pairwise interactions.
Despite its restriction to an interpretable model space, our system is competitive in terms of predictive performance on most data sets.
arXiv Detail & Related papers (2021-09-12T18:34:33Z) - Model LineUpper: Supporting Interactive Model Comparison at Multiple
Levels for AutoML [29.04776652873194]
In current AutoML systems, selection is supported only by performance metrics.
We develop tool to support interactive model comparison for AutoML by integrating multiple Explainable AI (XAI) and visualization techniques.
arXiv Detail & Related papers (2021-04-09T14:06:13Z) - Interpret-able feedback for AutoML systems [5.5524559605452595]
Automated machine learning (AutoML) systems aim to enable training machine learning (ML) models for non-ML experts.
A shortcoming of these systems is that when they fail to produce a model with high accuracy, the user has no path to improve the model.
We introduce an interpretable data feedback solution for AutoML.
arXiv Detail & Related papers (2021-02-22T18:54:26Z) - AutoRec: An Automated Recommender System [44.11798716678736]
We present AutoRec, an open-source automated machine learning (AutoML) platform extended from the ecosystem.
AutoRec supports a highly flexible pipeline that accommodates both sparse and dense inputs.
Experiments conducted on the benchmark datasets reveal AutoRec is reliable and can identify models which resemble the best model without prior knowledge.
arXiv Detail & Related papers (2020-06-26T17:04:53Z) - AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data [120.2298620652828]
We introduce AutoGluon-Tabular, an open-source AutoML framework that requires only a single line of Python to train highly accurate machine learning models.
Tests on a suite of 50 classification and regression tasks from Kaggle and the OpenML AutoML Benchmark reveal that AutoGluon is faster, more robust, and much more accurate.
arXiv Detail & Related papers (2020-03-13T23:10:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.