AutoSlicer: Scalable Automated Data Slicing for ML Model Analysis
- URL: http://arxiv.org/abs/2212.09032v1
- Date: Sun, 18 Dec 2022 07:49:17 GMT
- Title: AutoSlicer: Scalable Automated Data Slicing for ML Model Analysis
- Authors: Zifan Liu and Evan Rosen and Paul Suganthan G. C
- Abstract summary: We present Autoslicer, a scalable system that searches for problematic slices through distributed metric computation and hypothesis testing.
In the experiments, we show that our search strategy finds most of the anomalous slices by inspecting a small portion of the search space.
- Score: 3.3446830960153555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated slicing aims to identify subsets of evaluation data where a trained
model performs anomalously. This is an important problem for machine learning
pipelines in production since it plays a key role in model debugging and
comparison, as well as the diagnosis of fairness issues. Scalability has become
a critical requirement for any automated slicing system due to the large search
space of possible slices and the growing scale of data. We present Autoslicer,
a scalable system that searches for problematic slices through distributed
metric computation and hypothesis testing. We develop an efficient strategy
that reduces the search space through pruning and prioritization. In the
experiments, we show that our search strategy finds most of the anomalous
slices by inspecting a small portion of the search space.
Related papers
- GEqO: ML-Accelerated Semantic Equivalence Detection [3.5521901508676774]
Common computation is crucial for efficient cluster resource utilization and reducing job execution time.
detecting equivalence on large-scale analytics engines requires efficient and scalable solutions that are fully automated.
We propose GEqO, a portable and lightweight machine-learning-based framework for efficiently identifying semantically equivalent computations at scale.
arXiv Detail & Related papers (2024-01-02T16:37:42Z) - Auto-FP: An Experimental Study of Automated Feature Preprocessing for
Tabular Data [10.740391800262685]
Feature preprocessing is a crucial step to ensure good model quality.
Due to the large search space, a brute-force solution is prohibitively expensive.
We extend a variety of HPO and NAS algorithms to solve the Auto-FP problem.
arXiv Detail & Related papers (2023-10-04T02:46:44Z) - OutRank: Speeding up AutoML-based Model Search for Large Sparse Data
sets with Cardinality-aware Feature Ranking [0.0]
We introduce OutRank, a system for versatile feature ranking and data quality-related anomaly detection.
The proposed approach enables exploration of up to 300% larger feature spaces compared to AutoML-only approaches.
arXiv Detail & Related papers (2023-09-04T12:07:20Z) - DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning
over Tabular Data [12.416345241511781]
We propose DiffPrep to automatically and efficiently search for a data preprocessing pipeline for a given dataset.
Our experiments show that DiffPrep achieves the best test accuracy on 15 out of the 18 real-world datasets evaluated.
arXiv Detail & Related papers (2023-08-20T23:40:26Z) - Towards Personalized Preprocessing Pipeline Search [52.59156206880384]
ClusterP3S is a novel framework for Personalized Preprocessing Pipeline Search via Clustering.
We propose a hierarchical search strategy to jointly learn the clusters and search for the optimal pipelines.
Experiments on benchmark classification datasets demonstrate the effectiveness of enabling feature-wise preprocessing pipeline search.
arXiv Detail & Related papers (2023-02-28T05:45:05Z) - Unified Functional Hashing in Automatic Machine Learning [58.77232199682271]
We show that large efficiency gains can be obtained by employing a fast unified functional hash.
Our hash is "functional" in that it identifies equivalent candidates even if they were represented or coded differently.
We show dramatic improvements on multiple AutoML domains, including neural architecture search and algorithm discovery.
arXiv Detail & Related papers (2023-02-10T18:50:37Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z) - Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge
Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles.
Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center.
We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes.
A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z) - AutoOD: Automated Outlier Detection via Curiosity-guided Search and
Self-imitation Learning [72.99415402575886]
Outlier detection is an important data mining task with numerous practical applications.
We propose AutoOD, an automated outlier detection framework, which aims to search for an optimal neural network model.
Experimental results on various real-world benchmark datasets demonstrate that the deep model identified by AutoOD achieves the best performance.
arXiv Detail & Related papers (2020-06-19T18:57:51Z) - PyODDS: An End-to-end Outlier Detection System with Automated Machine
Learning [55.32009000204512]
We present PyODDS, an automated end-to-end Python system for Outlier Detection with Database Support.
Specifically, we define the search space in the outlier detection pipeline, and produce a search strategy within the given search space.
It also provides unified interfaces and visualizations for users with or without data science or machine learning background.
arXiv Detail & Related papers (2020-03-12T03:30:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.