Auto-FP: An Experimental Study of Automated Feature Preprocessing for
Tabular Data
- URL: http://arxiv.org/abs/2310.02540v1
- Date: Wed, 4 Oct 2023 02:46:44 GMT
- Title: Auto-FP: An Experimental Study of Automated Feature Preprocessing for
Tabular Data
- Authors: Danrui Qi and Jinglin Peng and Yongjun He and Jiannan Wang
- Abstract summary: Feature preprocessing is a crucial step to ensure good model quality.
Due to the large search space, a brute-force solution is prohibitively expensive.
We extend a variety of HPO and NAS algorithms to solve the Auto-FP problem.
- Score: 10.740391800262685
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Classical machine learning models, such as linear models and tree-based
models, are widely used in industry. These models are sensitive to data
distribution, thus feature preprocessing, which transforms features from one
distribution to another, is a crucial step to ensure good model quality.
Manually constructing a feature preprocessing pipeline is challenging because
data scientists need to make difficult decisions about which preprocessors to
select and in which order to compose them. In this paper, we study how to
automate feature preprocessing (Auto-FP) for tabular data. Due to the large
search space, a brute-force solution is prohibitively expensive. To address
this challenge, we interestingly observe that Auto-FP can be modelled as either
a hyperparameter optimization (HPO) or a neural architecture search (NAS)
problem. This observation enables us to extend a variety of HPO and NAS
algorithms to solve the Auto-FP problem. We conduct a comprehensive evaluation
and analysis of 15 algorithms on 45 public ML datasets. Overall,
evolution-based algorithms show the leading average ranking. Surprisingly, the
random search turns out to be a strong baseline. Many surrogate-model-based and
bandit-based search algorithms, which achieve good performance for HPO and NAS,
do not outperform random search for Auto-FP. We analyze the reasons for our
findings and conduct a bottleneck analysis to identify the opportunities to
improve these algorithms. Furthermore, we explore how to extend Auto-FP to
support parameter search and compare two ways to achieve this goal. In the end,
we evaluate Auto-FP in an AutoML context and discuss the limitations of popular
AutoML tools. To the best of our knowledge, this is the first study on
automated feature preprocessing. We hope our work can inspire researchers to
develop new algorithms tailored for Auto-FP.
Related papers
- AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving [68.73885845181242]
We propose an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios.
We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
arXiv Detail & Related papers (2024-03-26T04:27:56Z) - AutoFT: Learning an Objective for Robust Fine-Tuning [60.641186718253735]
Foundation models encode rich representations that can be adapted to downstream tasks by fine-tuning.
Current approaches to robust fine-tuning use hand-crafted regularization techniques.
We propose AutoFT, a data-driven approach for robust fine-tuning.
arXiv Detail & Related papers (2024-01-18T18:58:49Z) - AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planning [54.47116888545878]
AutoAct is an automatic agent learning framework for QA.
It does not rely on large-scale annotated data and synthetic planning trajectories from closed-source models.
arXiv Detail & Related papers (2024-01-10T16:57:24Z) - OutRank: Speeding up AutoML-based Model Search for Large Sparse Data
sets with Cardinality-aware Feature Ranking [0.0]
We introduce OutRank, a system for versatile feature ranking and data quality-related anomaly detection.
The proposed approach enables exploration of up to 300% larger feature spaces compared to AutoML-only approaches.
arXiv Detail & Related papers (2023-09-04T12:07:20Z) - AutoEn: An AutoML method based on ensembles of predefined Machine
Learning pipelines for supervised Traffic Forecasting [1.6242924916178283]
Traffic Forecasting (TF) is gaining relevance due to its ability to mitigate traffic congestion by forecasting future traffic states.
TF poses one big challenge to the Machine Learning paradigm, known as the Model Selection Problem (MSP)
We introduce AutoEn, which is a simple and efficient method for automatically generating multi-classifier ensembles from a predefined set of ML pipelines.
arXiv Detail & Related papers (2023-03-19T18:37:18Z) - AutoSlicer: Scalable Automated Data Slicing for ML Model Analysis [3.3446830960153555]
We present Autoslicer, a scalable system that searches for problematic slices through distributed metric computation and hypothesis testing.
In the experiments, we show that our search strategy finds most of the anomalous slices by inspecting a small portion of the search space.
arXiv Detail & Related papers (2022-12-18T07:49:17Z) - A new Sparse Auto-encoder based Framework using Grey Wolf Optimizer for
Data Classification Problem [0.0]
Gray wolf optimization (GWO) is applied to train sparse auto-encoders.
Model is validated by employing several popular Gene expression databases.
Results reveal that the performance of the trained model using GWO outperforms on both conventional models and models trained with most popular metaheuristic algorithms.
arXiv Detail & Related papers (2022-01-29T04:28:30Z) - Resource-Aware Pareto-Optimal Automated Machine Learning Platform [1.6746303554275583]
novel platform Resource-Aware AutoML (RA-AutoML)
RA-AutoML enables flexible and generalized algorithms to build machine learning models subjected to multiple objectives.
arXiv Detail & Related papers (2020-10-30T19:37:48Z) - Auto-PyTorch Tabular: Multi-Fidelity MetaLearning for Efficient and
Robust AutoDL [53.40030379661183]
Auto-PyTorch is a framework to enable fully automated deep learning (AutoDL)
It combines multi-fidelity optimization with portfolio construction for warmstarting and ensembling of deep neural networks (DNNs)
We show that Auto-PyTorch performs better than several state-of-the-art competitors on average.
arXiv Detail & Related papers (2020-06-24T15:15:17Z) - AutoFIS: Automatic Feature Interaction Selection in Factorization Models
for Click-Through Rate Prediction [75.16836697734995]
We propose a two-stage algorithm called Automatic Feature Interaction Selection (AutoFIS)
AutoFIS can automatically identify important feature interactions for factorization models with computational cost just equivalent to training the target model to convergence.
AutoFIS has been deployed onto the training platform of Huawei App Store recommendation service.
arXiv Detail & Related papers (2020-03-25T06:53:54Z) - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch [76.83052807776276]
We show that it is possible to automatically discover complete machine learning algorithms just using basic mathematical operations as building blocks.
We demonstrate this by introducing a novel framework that significantly reduces human bias through a generic search space.
We believe these preliminary successes in discovering machine learning algorithms from scratch indicate a promising new direction in the field.
arXiv Detail & Related papers (2020-03-06T19:00:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.