Robustness of AutoML on Dirty Categorical Data
- URL: http://arxiv.org/abs/2602.00412v1
- Date: Sat, 31 Jan 2026 00:05:59 GMT
- Title: Robustness of AutoML on Dirty Categorical Data
- Authors: Marcos L. P. Bueno, Joaquin Vanschoren,
- Abstract summary: The goal of automated machine learning (AutoML) is to reduce trial and error when doing machine learning (ML)<n>Recent research has shown that ML models can benefit from morphological encoders for dirty categorical data, leading to superior predictive performance.<n>We propose a pipeline that transforms categorical data into numerical data so that an AutoML can handle categorical data transformed by more advanced encoding schemes.
- Score: 10.798536038901903
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of automated machine learning (AutoML) is to reduce trial and error when doing machine learning (ML). Although AutoML methods for classification are able to deal with data imperfections, such as outliers, multiple scales and missing data, their behavior is less known on dirty categorical datasets. These datasets often have several categorical features with high cardinality arising from issues such as lack of curation and automated collection. Recent research has shown that ML models can benefit from morphological encoders for dirty categorical data, leading to significantly superior predictive performance. However the effects of using such encoders in AutoML methods are not known at the moment. In this paper, we propose a pipeline that transforms categorical data into numerical data so that an AutoML can handle categorical data transformed by more advanced encoding schemes. We benchmark the current robustness of AutoML methods on a set of dirty datasets and compare it with the proposed pipeline. This allows us to get insight on differences in predictive performance. We also look at the ML pipelines built by AutoMLs in order to gain insight beyond the best model as typically returned by these methods.
Related papers
- MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - Large Language Models for Automated Data Science: Introducing CAAFE for
Context-Aware Automated Feature Engineering [52.09178018466104]
We introduce Context-Aware Automated Feature Engineering (CAAFE) to generate semantically meaningful features for datasets.
Despite being methodologically simple, CAAFE improves performance on 11 out of 14 datasets.
We highlight the significance of context-aware solutions that can extend the scope of AutoML systems to semantic AutoML.
arXiv Detail & Related papers (2023-05-05T09:58:40Z) - SubStrat: A Subset-Based Strategy for Faster AutoML [5.833272638548153]
SubStrat is an AutoML optimization strategy that tackles the data size, rather than configuration space.
It wraps existing AutoML tools, and instead of executing them directly on the entire dataset, SubStrat uses a genetic-based algorithm to find a small subset.
It then employs the AutoML tool on the small subset, and finally, it refines the resulted pipeline by executing a restricted, much shorter, AutoML process on the large dataset.
arXiv Detail & Related papers (2022-06-07T07:44:06Z) - Towards Green Automated Machine Learning: Status Quo and Future
Directions [71.86820260846369]
AutoML is being criticised for its high resource consumption.
This paper proposes Green AutoML, a paradigm to make the whole AutoML process more environmentally friendly.
arXiv Detail & Related papers (2021-11-10T18:57:27Z) - Automatic Componentwise Boosting: An Interpretable AutoML System [1.1709030738577393]
We propose an AutoML system that constructs an interpretable additive model that can be fitted using a highly scalable componentwise boosting algorithm.
Our system provides tools for easy model interpretation such as visualizing partial effects and pairwise interactions.
Despite its restriction to an interpretable model space, our system is competitive in terms of predictive performance on most data sets.
arXiv Detail & Related papers (2021-09-12T18:34:33Z) - Man versus Machine: AutoML and Human Experts' Role in Phishing Detection [4.124446337711138]
This paper compares the performances of six well-known, state-of-the-art AutoML frameworks on ten different phishing datasets.
Our results indicate that AutoML-based models are able to outperform manually developed machine learning models in complex classification tasks.
arXiv Detail & Related papers (2021-08-27T09:26:20Z) - VolcanoML: Speeding up End-to-End AutoML via Scalable Search Space
Decomposition [57.06900573003609]
VolcanoML is a framework that decomposes a large AutoML search space into smaller ones.
It supports a Volcano-style execution model, akin to the one supported by modern database systems.
Our evaluation demonstrates that, not only does VolcanoML raise the level of expressiveness for search space decomposition in AutoML, it also leads to actual findings of decomposition strategies.
arXiv Detail & Related papers (2021-07-19T13:23:57Z) - Interpret-able feedback for AutoML systems [5.5524559605452595]
Automated machine learning (AutoML) systems aim to enable training machine learning (ML) models for non-ML experts.
A shortcoming of these systems is that when they fail to produce a model with high accuracy, the user has no path to improve the model.
We introduce an interpretable data feedback solution for AutoML.
arXiv Detail & Related papers (2021-02-22T18:54:26Z) - Fast, Accurate, and Simple Models for Tabular Data via Augmented
Distillation [97.42894942391575]
We propose FAST-DAD to distill arbitrarily complex ensemble predictors into individual models like boosted trees, random forests, and deep networks.
Our individual distilled models are over 10x faster and more accurate than ensemble predictors produced by AutoML tools like H2O/AutoSklearn.
arXiv Detail & Related papers (2020-06-25T09:57:47Z) - Evolution of Scikit-Learn Pipelines with Dynamic Structured Grammatical
Evolution [1.5224436211478214]
This paper describes a novel grammar-based framework that adapts Dynamic Structured Grammatical Evolution (DSGE) to the evolution of Scikit-Learn classification pipelines.
The experimental results include comparing AutoML-DSGE to another grammar-based AutoML framework, Resilient ClassificationPipeline Evolution (RECIPE)
arXiv Detail & Related papers (2020-04-01T09:31:34Z) - AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data [120.2298620652828]
We introduce AutoGluon-Tabular, an open-source AutoML framework that requires only a single line of Python to train highly accurate machine learning models.
Tests on a suite of 50 classification and regression tasks from Kaggle and the OpenML AutoML Benchmark reveal that AutoGluon is faster, more robust, and much more accurate.
arXiv Detail & Related papers (2020-03-13T23:10:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.