Related papers: Reliably detecting model failures in deployment without labels

Reliably detecting model failures in deployment without labels

URL: http://arxiv.org/abs/2506.05047v2
Date: Mon, 09 Jun 2025 16:57:42 GMT
Title: Reliably detecting model failures in deployment without labels
Authors: Viet Nguyen, Changjian Shui, Vijay Giri, Siddarth Arya, Amol Verma, Fahad Razak, Rahul G. Krishnan,
Abstract summary: This paper formalizes and addresses the problem of post-deployment deterioration (PDD) monitoring.<n>We propose D3M, a practical and efficient monitoring algorithm based on the disagreement of predictive models.<n> Empirical results on both standard benchmark and a real-world large-scale internal medicine dataset demonstrate the effectiveness of the framework.
Score: 10.006585036887929
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The distribution of data changes over time; models operating operating in dynamic environments need retraining. But knowing when to retrain, without access to labels, is an open challenge since some, but not all shifts degrade model performance. This paper formalizes and addresses the problem of post-deployment deterioration (PDD) monitoring. We propose D3M, a practical and efficient monitoring algorithm based on the disagreement of predictive models, achieving low false positive rates under non-deteriorating shifts and provides sample complexity bounds for high true positive rates under deteriorating shifts. Empirical results on both standard benchmark and a real-world large-scale internal medicine dataset demonstrate the effectiveness of the framework and highlight its viability as an alert mechanism for high-stakes machine learning pipelines.

Related papers

CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts [67.48102304531734]
We introduce CNS-Bench, a Continuous Nuisance Shift Benchmark to quantify robustness of image classifiers for continuous and realistic nuisance shifts.<n>We propose a filtering mechanism that outperforms previous methods, thereby enabling reliable benchmarking with generative models.
arXiv Detail & Related papers (2025-07-23T16:15:48Z)
Stress-Testing ML Pipelines with Adversarial Data Corruption [11.91482648083998]
Regulators now demand evidence that high-stakes systems can withstand realistic, interdependent errors.<n>We introduce SAVAGE, a framework that formally models data-quality issues through dependency graphs and flexible corruption templates.<n>Savanage employs a bi-level optimization approach to efficiently identify vulnerable data subpopulations and fine-tune corruption severity.
arXiv Detail & Related papers (2025-06-02T00:41:24Z)
WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales [13.807613678989664]
Methods for nonparametric sequential testing -- especially conformal test martingales (CTMs) and anytime-valid inference -- offer promising tools for this monitoring task.<n>Existing approaches are restricted to monitoring limited hypothesis classes or alarm criteria''
arXiv Detail & Related papers (2025-05-07T17:53:47Z)
Strengthening Anomaly Awareness [0.0]
We present a refined version of the Anomaly Awareness framework for enhancing unsupervised anomaly detection.<n>Our approach introduces minimal supervision into Variational Autoencoders (VAEs) through a two-stage training strategy.
arXiv Detail & Related papers (2025-04-15T16:52:22Z)
Robust Distribution Alignment for Industrial Anomaly Detection under Distribution Shift [51.24522135151649]
Anomaly detection plays a crucial role in quality control for industrial applications.<n>Existing methods attempt to address domain shifts by training generalizable models.<n>Our proposed method demonstrates superior results compared with state-of-the-art anomaly detection and domain adaptation methods.
arXiv Detail & Related papers (2025-03-19T05:25:52Z)
Does Unsupervised Domain Adaptation Improve the Robustness of Amortized Bayesian Inference? A Systematic Evaluation [3.4109073456116477]
Recent robust approaches employ unsupervised domain adaptation (UDA) to match the embedding spaces of simulated and observed data.<n>We demonstrate that aligning summary spaces between domains effectively mitigates the impact of unmodeled phenomena or noise.<n>Our results underscore the need for careful consideration of misspecification types when using UDA to increase the robustness of ABI.
arXiv Detail & Related papers (2025-02-07T14:13:51Z)
Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data. We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z)
An Outlier Exposure Approach to Improve Visual Anomaly Detection Performance for Mobile Robots [76.36017224414523]
We consider the problem of building visual anomaly detection systems for mobile robots. Standard anomaly detection models are trained using large datasets composed only of non-anomalous data. We tackle the problem of exploiting these data to improve the performance of a Real-NVP anomaly detection model.
arXiv Detail & Related papers (2022-09-20T15:18:13Z)
CausalAgents: A Robustness Benchmark for Motion Forecasting using Causal Relationships [8.679073301435265]
We construct a new benchmark for evaluating and improving model robustness by applying perturbations to existing data. We use these labels to perturb the data by deleting non-causal agents from the scene. Under non-causal perturbations, we observe a $25$-$38%$ relative change in minADE as compared to the original.
arXiv Detail & Related papers (2022-07-07T21:28:23Z)
Self-Supervised Training with Autoencoders for Visual Anomaly Detection [61.62861063776813]
We focus on a specific use case in anomaly detection where the distribution of normal samples is supported by a lower-dimensional manifold. We adapt a self-supervised learning regime that exploits discriminative information during training but focuses on the submanifold of normal examples. We achieve a new state-of-the-art result on the MVTec AD dataset -- a challenging benchmark for visual anomaly detection in the manufacturing domain.
arXiv Detail & Related papers (2022-06-23T14:16:30Z)
Tracking the risk of a deployed model and detecting harmful distribution shifts [105.27463615756733]
In practice, it may make sense to ignore benign shifts, under which the performance of a deployed model does not degrade substantially. We argue that a sensible method for firing off a warning has to both (a) detect harmful shifts while ignoring benign ones, and (b) allow continuous monitoring of model performance without increasing the false alarm rate.
arXiv Detail & Related papers (2021-10-12T17:21:41Z)
SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection [63.253850875265115]
Outlier detection (OD) is a key machine learning (ML) task for identifying abnormal objects from general samples. We propose a modular acceleration system, called SUOD, to address it.
arXiv Detail & Related papers (2020-03-11T00:22:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.