Data Debugging with Shapley Importance over End-to-End Machine Learning
Pipelines
- URL: http://arxiv.org/abs/2204.11131v2
- Date: Tue, 26 Apr 2022 19:32:40 GMT
- Title: Data Debugging with Shapley Importance over End-to-End Machine Learning
Pipelines
- Authors: Bojan Karla\v{s}, David Dao, Matteo Interlandi, Bo Li, Sebastian
Schelter, Wentao Wu, Ce Zhang
- Abstract summary: DataScope is the first system that efficiently computes Shapley values of training examples over an end-to-end machine learning pipeline.
Our results show that DataScope is up to four orders of magnitude faster than state-of-the-art Monte Carlo-based methods.
- Score: 27.461398584509755
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Developing modern machine learning (ML) applications is data-centric, of
which one fundamental challenge is to understand the influence of data quality
to ML training -- "Which training examples are 'guilty' in making the trained
ML model predictions inaccurate or unfair?" Modeling data influence for ML
training has attracted intensive interest over the last decade, and one popular
framework is to compute the Shapley value of each training example with respect
to utilities such as validation accuracy and fairness of the trained ML model.
Unfortunately, despite recent intensive interest and research, existing methods
only consider a single ML model "in isolation" and do not consider an
end-to-end ML pipeline that consists of data transformations, feature
extractors, and ML training.
We present DataScope (ease.ml/datascope), the first system that efficiently
computes Shapley values of training examples over an end-to-end ML pipeline,
and illustrate its applications in data debugging for ML training. To this end,
we first develop a novel algorithmic framework that computes Shapley value over
a specific family of ML pipelines that we call canonical pipelines: a positive
relational algebra query followed by a K-nearest-neighbor (KNN) classifier. We
show that, for many subfamilies of canonical pipelines, computing Shapley value
is in PTIME, contrasting the exponential complexity of computing Shapley value
in general. We then put this to practice -- given an sklearn pipeline, we
approximate it with a canonical pipeline to use as a proxy. We conduct
extensive experiments illustrating different use cases and utilities. Our
results show that DataScope is up to four orders of magnitude faster over
state-of-the-art Monte Carlo-based methods, while being comparably, and often
even more, effective in data debugging.
Related papers
- Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more-efficient metric for performance estimation.
We extend the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources.
We employ a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance.
arXiv Detail & Related papers (2024-10-11T04:57:48Z) - Physics Informed Machine Learning (PIML) methods for estimating the remaining useful lifetime (RUL) of aircraft engines [0.0]
This paper is aimed at using the newly developing field of physics informed machine learning (PIML) to develop models for predicting the remaining useful lifetime (RUL) aircraft engines.
We consider the well-known benchmark NASA Commercial Modular Aero-Propulsion System Simulation System (C-MAPSS) data as the main data for this paper.
C-MAPSS is a well-studied dataset with much existing work in the literature that address RUL prediction with classical and deep learning methods.
arXiv Detail & Related papers (2024-06-21T19:55:34Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - ezDPS: An Efficient and Zero-Knowledge Machine Learning Inference
Pipeline [2.0813318162800707]
We propose ezDPS, a new efficient and zero-knowledge Machine Learning inference scheme.
ezDPS is a zkML pipeline in which the data is processed in multiple stages for high accuracy.
We show that ezDPS achieves one-to-three orders of magnitude more efficient than the generic circuit-based approach in all metrics.
arXiv Detail & Related papers (2022-12-11T06:47:28Z) - Pushing the Limits of Simple Pipelines for Few-Shot Learning: External
Data and Fine-Tuning Make a Difference [74.80730361332711]
Few-shot learning is an important and topical problem in computer vision.
We show that a simple transformer-based pipeline yields surprisingly good performance on standard benchmarks.
arXiv Detail & Related papers (2022-04-15T02:55:58Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Exploring Opportunistic Meta-knowledge to Reduce Search Spaces for
Automated Machine Learning [8.325359814939517]
This paper investigates whether, based on previous experience, a pool of available classifiers/regressors can be preemptively culled ahead of initiating a pipeline composition/optimisation process.
arXiv Detail & Related papers (2021-05-01T15:25:30Z) - AutoWeka4MCPS-AVATAR: Accelerating Automated Machine Learning Pipeline
Composition and Optimisation [13.116806430326513]
We propose a novel method to evaluate the validity of ML pipelines, without their execution, using a surrogate model (AVATAR)
The AVATAR generates a knowledge base by automatically learning the capabilities and effects of ML algorithms on datasets' characteristics.
Instead of executing the original ML pipeline to evaluate its validity, the AVATAR evaluates its surrogate model constructed by capabilities and effects of the ML pipeline components.
arXiv Detail & Related papers (2020-11-21T14:05:49Z) - A Rigorous Machine Learning Analysis Pipeline for Biomedical Binary
Classification: Application in Pancreatic Cancer Nested Case-control Studies
with Implications for Bias Assessments [2.9726886415710276]
We have laid out and assembled a complete, rigorous ML analysis pipeline focused on binary classification.
This 'automated' but customizable pipeline includes a) exploratory analysis, b) data cleaning and transformation, c) feature selection, d) model training with 9 established ML algorithms.
We apply this pipeline to an epidemiological investigation of established and newly identified risk factors for cancer to evaluate how different sources of bias might be handled by ML algorithms.
arXiv Detail & Related papers (2020-08-28T19:58:05Z) - A Survey on Large-scale Machine Learning [67.6997613600942]
Machine learning can provide deep insights into data, allowing machines to make high-quality predictions.
Most sophisticated machine learning approaches suffer from huge time costs when operating on large-scale data.
Large-scale Machine Learning aims to learn patterns from big data with comparable performance efficiently.
arXiv Detail & Related papers (2020-08-10T06:07:52Z) - Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach.
IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.