Statistically Valid Post-Deployment Monitoring Should Be Standard for AI-Based Digital Health
- URL: http://arxiv.org/abs/2506.05701v1
- Date: Fri, 06 Jun 2025 03:04:44 GMT
- Title: Statistically Valid Post-Deployment Monitoring Should Be Standard for AI-Based Digital Health
- Authors: Pavel Dolin, Weizhi Li, Gautam Dasarathy, Visar Berisha,
- Abstract summary: Only 9% of FDA-registered AI-based healthcare tools include a post-deployment surveillance plan.<n> Existing monitoring approaches are often manual, sporadic, and reactive.<n>We propose that the detection of changes in the data and model performance degradation should be framed as distinct statistical hypothesis testing problems.
- Score: 14.256683587576935
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This position paper argues that post-deployment monitoring in clinical AI is underdeveloped and proposes statistically valid and label-efficient testing frameworks as a principled foundation for ensuring reliability and safety in real-world deployment. A recent review found that only 9% of FDA-registered AI-based healthcare tools include a post-deployment surveillance plan. Existing monitoring approaches are often manual, sporadic, and reactive, making them ill-suited for the dynamic environments in which clinical models operate. We contend that post-deployment monitoring should be grounded in label-efficient and statistically valid testing frameworks, offering a principled alternative to current practices. We use the term "statistically valid" to refer to methods that provide explicit guarantees on error rates (e.g., Type I/II error), enable formal inference under pre-defined assumptions, and support reproducibility--features that align with regulatory requirements. Specifically, we propose that the detection of changes in the data and model performance degradation should be framed as distinct statistical hypothesis testing problems. Grounding monitoring in statistical rigor ensures a reproducible and scientifically sound basis for maintaining the reliability of clinical AI systems. Importantly, it also opens new research directions for the technical community--spanning theory, methods, and tools for statistically principled detection, attribution, and mitigation of post-deployment model failures in real-world settings.
Related papers
- Calibrated Prediction Set in Fault Detection with Risk Guarantees via Significance Tests [3.500936878570599]
This paper proposes a novel fault detection method that integrates significance testing with the conformal prediction framework to provide formal risk guarantees.<n>The proposed method consistently achieves an empirical coverage rate at or above the nominal level ($1-alpha$)<n>The results reveal a controllable trade-off between the user-defined risk level ($alpha$) and efficiency, where higher risk tolerance leads to smaller average prediction set sizes.
arXiv Detail & Related papers (2025-08-02T05:49:02Z) - WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales [13.807613678989664]
Methods for nonparametric sequential testing -- especially conformal test martingales (CTMs) and anytime-valid inference -- offer promising tools for this monitoring task.<n>Existing approaches are restricted to monitoring limited hypothesis classes or alarm criteria''
arXiv Detail & Related papers (2025-05-07T17:53:47Z) - Context-Aware Online Conformal Anomaly Detection with Prediction-Powered Data Acquisition [35.59201763567714]
We introduce context-aware prediction-powered conformal online anomaly detection (C-PP-COAD)<n>Our framework strategically leverages synthetic calibration data to mitigate data scarcity, while adaptively integrating real data based on contextual cues.<n>Experiments conducted on both synthetic and real-world datasets demonstrate that C-PP-COAD significantly reduces dependency on real calibration data without compromising guaranteed false discovery rate (FDR)
arXiv Detail & Related papers (2025-05-03T10:58:05Z) - Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation [52.83870601473094]
Embodied agents exhibit immense potential across a multitude of domains.<n>Existing research predominantly concentrates on the security of general large language models.<n>This paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents.
arXiv Detail & Related papers (2025-04-22T08:34:35Z) - New Statistical Framework for Extreme Error Probability in High-Stakes Domains for Reliable Machine Learning [4.14360329494344]
Extreme Value Theory (EVT) is a statistical framework that provides a rigorous approach to estimating worst-case failures.<n>Applying EVT to synthetic and real-world datasets, this method is shown to enable robust estimation of catastrophic failure probabilities.<n>This work establishes EVT as a fundamental tool for assessing model reliability, ensuring safer AI deployment in new technologies.
arXiv Detail & Related papers (2025-03-31T16:08:11Z) - Causal Lifting of Neural Representations: Zero-Shot Generalization for Causal Inferences [56.23412698865433]
We focus on Prediction-Powered Causal Inferences (PPCI)<n> PPCI estimates the treatment effect in a target experiment with unlabeled factual outcomes, retrievable zero-shot from a pre-trained model.<n>We validate our method on synthetic and real-world scientific data, offering solutions to instances not solvable by vanilla Empirical Risk Minimization.
arXiv Detail & Related papers (2025-02-10T10:52:17Z) - Evaluating the Effectiveness of Index-Based Treatment Allocation [42.040099398176665]
When resources are scarce, an allocation policy is needed to decide who receives a resource.
This paper introduces methods to evaluate index-based allocation policies using data from a randomized control trial.
arXiv Detail & Related papers (2024-02-19T01:55:55Z) - Designing monitoring strategies for deployed machine learning
algorithms: navigating performativity through a causal lens [6.329470650220206]
The aim of this work is to highlight the relatively under-appreciated complexity of designing a monitoring strategy.
We consider an ML-based risk prediction algorithm for predicting unplanned readmissions.
Results from this case study emphasize the seemingly simple (and obvious) fact that not all monitoring systems are created equal.
arXiv Detail & Related papers (2023-11-20T00:15:16Z) - Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Safe AI for health and beyond -- Monitoring to transform a health
service [51.8524501805308]
We will assess the infrastructure required to monitor the outputs of a machine learning algorithm.
We will present two scenarios with examples of monitoring and updates of models.
arXiv Detail & Related papers (2023-03-02T17:27:45Z) - Recursively Feasible Probabilistic Safe Online Learning with Control Barrier Functions [60.26921219698514]
We introduce a model-uncertainty-aware reformulation of CBF-based safety-critical controllers.
We then present the pointwise feasibility conditions of the resulting safety controller.
We use these conditions to devise an event-triggered online data collection strategy.
arXiv Detail & Related papers (2022-08-23T05:02:09Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Differential privacy and robust statistics in high dimensions [49.50869296871643]
High-dimensional Propose-Test-Release (HPTR) builds upon three crucial components: the exponential mechanism, robust statistics, and the Propose-Test-Release mechanism.
We show that HPTR nearly achieves the optimal sample complexity under several scenarios studied in the literature.
arXiv Detail & Related papers (2021-11-12T06:36:40Z) - Trust but Verify: Assigning Prediction Credibility by Counterfactual
Constrained Learning [123.3472310767721]
Prediction credibility measures are fundamental in statistics and machine learning.
These measures should account for the wide variety of models used in practice.
The framework developed in this work expresses the credibility as a risk-fit trade-off.
arXiv Detail & Related papers (2020-11-24T19:52:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.