Model Assertions for Monitoring and Improving ML Models
- URL: http://arxiv.org/abs/2003.01668v3
- Date: Wed, 11 Mar 2020 23:30:56 GMT
- Title: Model Assertions for Monitoring and Improving ML Models
- Authors: Daniel Kang, Deepti Raghavan, Peter Bailis, Matei Zaharia
- Abstract summary: We propose a new abstraction, model assertions, that adapts the classical use of program assertions as a way to monitor and improve ML models.
Model assertions are arbitrary functions over a model's input and output that indicate when errors may be occurring.
We propose methods of using model assertions at all stages of ML system deployment, including runtime monitoring, validating labels, and continuously improving ML models.
- Score: 26.90089824436192
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: ML models are increasingly deployed in settings with real world interactions
such as vehicles, but unfortunately, these models can fail in systematic ways.
To prevent errors, ML engineering teams monitor and continuously improve these
models. We propose a new abstraction, model assertions, that adapts the
classical use of program assertions as a way to monitor and improve ML models.
Model assertions are arbitrary functions over a model's input and output that
indicate when errors may be occurring, e.g., a function that triggers if an
object rapidly changes its class in a video. We propose methods of using model
assertions at all stages of ML system deployment, including runtime monitoring,
validating labels, and continuously improving ML models. For runtime
monitoring, we show that model assertions can find high confidence errors,
where a model returns the wrong output with high confidence, which
uncertainty-based monitoring techniques would not detect. For training, we
propose two methods of using model assertions. First, we propose a bandit-based
active learning algorithm that can sample from data flagged by assertions and
show that it can reduce labeling costs by up to 40% over traditional
uncertainty-based methods. Second, we propose an API for generating
"consistency assertions" (e.g., the class change example) and weak labels for
inputs where the consistency assertions fail, and show that these weak labels
can improve relative model quality by up to 46%. We evaluate model assertions
on four real-world tasks with video, LIDAR, and ECG data.
Related papers
- DECIDER: Leveraging Foundation Model Priors for Improved Model Failure Detection and Explanation [18.77296551727931]
We propose DECIDER, a novel approach that leverages priors from large language models (LLMs) and vision-language models (VLMs) to detect failures in image models.
DECIDER consistently achieves state-of-the-art failure detection performance, significantly outperforming baselines in terms of the overall Matthews correlation coefficient.
arXiv Detail & Related papers (2024-08-01T07:08:11Z) - Small Language Models Need Strong Verifiers to Self-Correct Reasoning [69.94251699982388]
Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs)
This work explores whether small (= 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs.
arXiv Detail & Related papers (2024-04-26T03:41:28Z) - AI Model Disgorgement: Methods and Choices [127.54319351058167]
We introduce a taxonomy of possible disgorgement methods that are applicable to modern machine learning systems.
We investigate the meaning of "removing the effects" of data in the trained model in a way that does not require retraining from scratch.
arXiv Detail & Related papers (2023-04-07T08:50:18Z) - Monitoring Model Deterioration with Explainable Uncertainty Estimation
via Non-parametric Bootstrap [0.0]
Monitoring machine learning models once they are deployed is challenging.
It is even more challenging to decide when to retrain models in real-case scenarios when labeled data is beyond reach.
In this work, we use non-parametric bootstrapped uncertainty estimates and SHAP values to provide explainable uncertainty estimation.
arXiv Detail & Related papers (2022-01-27T17:23:04Z) - Scanflow: A multi-graph framework for Machine Learning workflow
management, supervision, and debugging [0.0]
We propose a novel containerized directed graph framework to support end-to-end Machine Learning workflow management.
The framework allows defining and deploying ML in containers, tracking their metadata, checking their behavior in production, and improving the models by using both learned and human-provided knowledge.
arXiv Detail & Related papers (2021-11-04T17:01:12Z) - Mismatched No More: Joint Model-Policy Optimization for Model-Based RL [172.37829823752364]
We propose a single objective for jointly training the model and the policy, such that updates to either component increases a lower bound on expected return.
Our objective is a global lower bound on expected return, and this bound becomes tight under certain assumptions.
The resulting algorithm (MnM) is conceptually similar to a GAN.
arXiv Detail & Related papers (2021-10-06T13:43:27Z) - Regression Bugs Are In Your Model! Measuring, Reducing and Analyzing
Regressions In NLP Model Updates [68.09049111171862]
This work focuses on quantifying, reducing and analyzing regression errors in the NLP model updates.
We formulate the regression-free model updates into a constrained optimization problem.
We empirically analyze how model ensemble reduces regression.
arXiv Detail & Related papers (2021-05-07T03:33:00Z) - Defuse: Harnessing Unrestricted Adversarial Examples for Debugging
Models Beyond Test Accuracy [11.265020351747916]
Defuse is a method to automatically discover and correct model errors beyond those available in test data.
We propose an algorithm inspired by adversarial machine learning techniques that uses a generative model to find naturally occurring instances misclassified by a model.
Defuse corrects the error after fine-tuning while maintaining generalization on the test set.
arXiv Detail & Related papers (2021-02-11T18:08:42Z) - Probing Model Signal-Awareness via Prediction-Preserving Input
Minimization [67.62847721118142]
We evaluate models' ability to capture the correct vulnerability signals to produce their predictions.
We measure the signal awareness of models using a new metric we propose- Signal-aware Recall (SAR)
The results show a sharp drop in the model's Recall from the high 90s to sub-60s with the new metric.
arXiv Detail & Related papers (2020-11-25T20:05:23Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z) - Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods [24.190587751595455]
Weak supervision is a popular method for building machine learning models without relying on ground truth annotations.
Existing approaches use latent variable estimation to model the noisy sources.
We show that for a class of latent variable models highly applicable to weak supervision, we can find a closed-form solution to model parameters.
We use this insight to build FlyingSquid, a weak supervision framework that runs orders of magnitude faster than previous weak supervision approaches.
arXiv Detail & Related papers (2020-02-27T07:51:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.