Related papers: AutoSlicer: Scalable Automated Data Slicing for ML Model Analysis

AutoSlicer: Scalable Automated Data Slicing for ML Model Analysis

URL: http://arxiv.org/abs/2212.09032v1
Date: Sun, 18 Dec 2022 07:49:17 GMT
Title: AutoSlicer: Scalable Automated Data Slicing for ML Model Analysis
Authors: Zifan Liu and Evan Rosen and Paul Suganthan G. C
Abstract summary: We present Autoslicer, a scalable system that searches for problematic slices through distributed metric computation and hypothesis testing. In the experiments, we show that our search strategy finds most of the anomalous slices by inspecting a small portion of the search space.
Score: 3.3446830960153555
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated slicing aims to identify subsets of evaluation data where a trained model performs anomalously. This is an important problem for machine learning pipelines in production since it plays a key role in model debugging and comparison, as well as the diagnosis of fairness issues. Scalability has become a critical requirement for any automated slicing system due to the large search space of possible slices and the growing scale of data. We present Autoslicer, a scalable system that searches for problematic slices through distributed metric computation and hypothesis testing. We develop an efficient strategy that reduces the search space through pruning and prioritization. In the experiments, we show that our search strategy finds most of the anomalous slices by inspecting a small portion of the search space.

Related papers

GEqO: ML-Accelerated Semantic Equivalence Detection [3.5521901508676774]
Common computation is crucial for efficient cluster resource utilization and reducing job execution time. detecting equivalence on large-scale analytics engines requires efficient and scalable solutions that are fully automated. We propose GEqO, a portable and lightweight machine-learning-based framework for efficiently identifying semantically equivalent computations at scale.
arXiv Detail & Related papers (2024-01-02T16:37:42Z)
Auto-FP: An Experimental Study of Automated Feature Preprocessing for Tabular Data [10.740391800262685]
Feature preprocessing is a crucial step to ensure good model quality. Due to the large search space, a brute-force solution is prohibitively expensive. We extend a variety of HPO and NAS algorithms to solve the Auto-FP problem.
arXiv Detail & Related papers (2023-10-04T02:46:44Z)
OutRank: Speeding up AutoML-based Model Search for Large Sparse Data sets with Cardinality-aware Feature Ranking [0.0]
We introduce OutRank, a system for versatile feature ranking and data quality-related anomaly detection. The proposed approach enables exploration of up to 300% larger feature spaces compared to AutoML-only approaches.
arXiv Detail & Related papers (2023-09-04T12:07:20Z)
DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data [12.416345241511781]
We propose DiffPrep to automatically and efficiently search for a data preprocessing pipeline for a given dataset. Our experiments show that DiffPrep achieves the best test accuracy on 15 out of the 18 real-world datasets evaluated.
arXiv Detail & Related papers (2023-08-20T23:40:26Z)
Towards Personalized Preprocessing Pipeline Search [52.59156206880384]
ClusterP3S is a novel framework for Personalized Preprocessing Pipeline Search via Clustering. We propose a hierarchical search strategy to jointly learn the clusters and search for the optimal pipelines. Experiments on benchmark classification datasets demonstrate the effectiveness of enabling feature-wise preprocessing pipeline search.
arXiv Detail & Related papers (2023-02-28T05:45:05Z)
Unified Functional Hashing in Automatic Machine Learning [58.77232199682271]
We show that large efficiency gains can be obtained by employing a fast unified functional hash. Our hash is "functional" in that it identifies equivalent candidates even if they were represented or coded differently. We show dramatic improvements on multiple AutoML domains, including neural architecture search and algorithm discovery.
arXiv Detail & Related papers (2023-02-10T18:50:37Z)
Scale-aware Automatic Augmentation for Object Detection [63.087930708444695]
We propose Scale-aware AutoAug to learn data augmentation policies for object detection. In experiments, Scale-aware AutoAug yields significant and consistent improvement on various object detectors.
arXiv Detail & Related papers (2021-03-31T17:11:14Z)
Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization. We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z)
Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles. Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center. We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes. A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z)
AutoOD: Automated Outlier Detection via Curiosity-guided Search and Self-imitation Learning [72.99415402575886]
Outlier detection is an important data mining task with numerous practical applications. We propose AutoOD, an automated outlier detection framework, which aims to search for an optimal neural network model. Experimental results on various real-world benchmark datasets demonstrate that the deep model identified by AutoOD achieves the best performance.
arXiv Detail & Related papers (2020-06-19T18:57:51Z)
PyODDS: An End-to-end Outlier Detection System with Automated Machine Learning [55.32009000204512]
We present PyODDS, an automated end-to-end Python system for Outlier Detection with Database Support. Specifically, we define the search space in the outlier detection pipeline, and produce a search strategy within the given search space. It also provides unified interfaces and visualizations for users with or without data science or machine learning background.
arXiv Detail & Related papers (2020-03-12T03:30:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.