Related papers: WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales

WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales

URL: http://arxiv.org/abs/2505.04608v3
Date: Sun, 01 Jun 2025 20:18:25 GMT
Title: WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales
Authors: Drew Prinster, Xing Han, Anqi Liu, Suchi Saria,
Abstract summary: Methods for nonparametric sequential testing -- especially conformal test martingales (CTMs) and anytime-valid inference -- offer promising tools for this monitoring task.<n>Existing approaches are restricted to monitoring limited hypothesis classes or alarm criteria''
Score: 13.807613678989664
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Responsibly deploying artificial intelligence (AI) / machine learning (ML) systems in high-stakes settings arguably requires not only proof of system reliability, but also continual, post-deployment monitoring to quickly detect and address any unsafe behavior. Methods for nonparametric sequential testing -- especially conformal test martingales (CTMs) and anytime-valid inference -- offer promising tools for this monitoring task. However, existing approaches are restricted to monitoring limited hypothesis classes or ``alarm criteria'' (e.g., detecting data shifts that violate certain exchangeability or IID assumptions), do not allow for online adaptation in response to shifts, and/or cannot diagnose the cause of degradation or alarm. In this paper, we address these limitations by proposing a weighted generalization of conformal test martingales (WCTMs), which lay a theoretical foundation for online monitoring for any unexpected changepoints in the data distribution while controlling false-alarms. For practical applications, we propose specific WCTM algorithms that adapt online to mild covariate shifts (in the marginal input distribution), quickly detect harmful shifts, and diagnose those harmful shifts as concept shifts (in the conditional label distribution) or extreme (out-of-support) covariate shifts that cannot be easily adapted to. On real-world datasets, we demonstrate improved performance relative to state-of-the-art baselines.

Related papers

Statistically Valid Post-Deployment Monitoring Should Be Standard for AI-Based Digital Health [14.256683587576935]
Only 9% of FDA-registered AI-based healthcare tools include a post-deployment surveillance plan.<n> Existing monitoring approaches are often manual, sporadic, and reactive.<n>We propose that the detection of changes in the data and model performance degradation should be framed as distinct statistical hypothesis testing problems.
arXiv Detail & Related papers (2025-06-06T03:04:44Z)
Reliably detecting model failures in deployment without labels [10.006585036887929]
This paper formalizes and addresses the problem of post-deployment deterioration (PDD) monitoring.<n>We propose D3M, a practical and efficient monitoring algorithm based on the disagreement of predictive models.<n> Empirical results on both standard benchmark and a real-world large-scale internal medicine dataset demonstrate the effectiveness of the framework.
arXiv Detail & Related papers (2025-06-05T13:56:18Z)
Context-Aware Online Conformal Anomaly Detection with Prediction-Powered Data Acquisition [35.59201763567714]
We introduce context-aware prediction-powered conformal online anomaly detection (C-PP-COAD)<n>Our framework strategically leverages synthetic calibration data to mitigate data scarcity, while adaptively integrating real data based on contextual cues.<n>Experiments conducted on both synthetic and real-world datasets demonstrate that C-PP-COAD significantly reduces dependency on real calibration data without compromising guaranteed false discovery rate (FDR)
arXiv Detail & Related papers (2025-05-03T10:58:05Z)
Reshaping the Online Data Buffering and Organizing Mechanism for Continual Test-Time Adaptation [49.53202761595912]
Continual Test-Time Adaptation involves adapting a pre-trained source model to continually changing unsupervised target domains. We analyze the challenges of this task: online environment, unsupervised nature, and the risks of error accumulation and catastrophic forgetting. We propose an uncertainty-aware buffering approach to identify and aggregate significant samples with high certainty from the unsupervised, single-pass data stream.
arXiv Detail & Related papers (2024-07-12T15:48:40Z)
Automatically Adaptive Conformal Risk Control [49.95190019041905]
We propose a methodology for achieving approximate conditional control of statistical risks by adapting to the difficulty of test samples.<n>Our framework goes beyond traditional conditional risk control based on user-provided conditioning events to the algorithmic, data-driven determination of appropriate function classes for conditioning.
arXiv Detail & Related papers (2024-06-25T08:29:32Z)
Continuous Test-time Domain Adaptation for Efficient Fault Detection under Evolving Operating Conditions [10.627285023764086]
We propose a Test-time domain Adaptation Anomaly Detection (TAAD) framework that separates input variables into system parameters and measurements. Our approach, tested on a real-world pump monitoring dataset, shows significant improvements over existing domain adaptation methods in fault detection.
arXiv Detail & Related papers (2024-06-06T15:53:14Z)
Incorporating Gradients to Rules: Towards Lightweight, Adaptive Provenance-based Intrusion Detection [11.14938737864796]
We propose CAPTAIN, a rule-based PIDS capable of automatically adapting to diverse environments. We build a differentiable tag propagation framework and utilize the gradient descent algorithm to optimize these adaptive parameters. The evaluation results demonstrate that CAPTAIN offers better detection accuracy, less detection latency, lower runtime overhead, and more interpretable detection alarms and knowledge.
arXiv Detail & Related papers (2024-04-23T03:50:57Z)
Condition Monitoring with Incomplete Data: An Integrated Variational Autoencoder and Distance Metric Framework [2.7898966850590625]
This paper introduces a new method for fault detection and condition monitoring for unseen data. We use a variational autoencoder to capture the probabilistic distribution of previously seen and new unseen conditions. Faults are detected by establishing a threshold for the health indexes, allowing the model to identify severe, unseen faults with high accuracy, even amidst noise.
arXiv Detail & Related papers (2024-04-08T22:20:23Z)
Enhancing Security in Federated Learning through Adaptive Consensus-Based Model Update Validation [2.28438857884398]
This paper introduces an advanced approach for fortifying Federated Learning (FL) systems against label-flipping attacks. We propose a consensus-based verification process integrated with an adaptive thresholding mechanism. Our results indicate a significant mitigation of label-flipping attacks, bolstering the FL system's resilience.
arXiv Detail & Related papers (2024-03-05T20:54:56Z)
Cal-DETR: Calibrated Detection Transformer [67.75361289429013]
We propose a mechanism for calibrated detection transformers (Cal-DETR), particularly for Deformable-DETR, UP-DETR and DINO. We develop an uncertainty-guided logit modulation mechanism that leverages the uncertainty to modulate the class logits. Results corroborate the effectiveness of Cal-DETR against the competing train-time methods in calibrating both in-domain and out-domain detections.
arXiv Detail & Related papers (2023-11-06T22:13:10Z)
DARTH: Holistic Test-time Adaptation for Multiple Object Tracking [87.72019733473562]
Multiple object tracking (MOT) is a fundamental component of perception systems for autonomous driving. Despite the urge of safety in driving systems, no solution to the MOT adaptation problem to domain shift in test-time conditions has ever been proposed. We introduce DARTH, a holistic test-time adaptation framework for MOT.
arXiv Detail & Related papers (2023-10-03T10:10:42Z)
Tracking the risk of a deployed model and detecting harmful distribution shifts [105.27463615756733]
In practice, it may make sense to ignore benign shifts, under which the performance of a deployed model does not degrade substantially. We argue that a sensible method for firing off a warning has to both (a) detect harmful shifts while ignoring benign ones, and (b) allow continuous monitoring of model performance without increasing the false alarm rate.
arXiv Detail & Related papers (2021-10-12T17:21:41Z)
Detecting Rewards Deterioration in Episodic Reinforcement Learning [63.49923393311052]
In many RL applications, once training ends, it is vital to detect any deterioration in the agent performance as soon as possible. We consider an episodic framework, where the rewards within each episode are not independent, nor identically-distributed, nor Markov. We define the mean-shift in a way corresponding to deterioration of a temporal signal (such as the rewards), and derive a test for this problem with optimal statistical power.
arXiv Detail & Related papers (2020-10-22T12:45:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.