Related papers: All models are local: time to replace external validation with recurrent local validation

All models are local: time to replace external validation with recurrent local validation

URL: http://arxiv.org/abs/2305.03219v2
Date: Sat, 13 May 2023 04:20:16 GMT
Title: All models are local: time to replace external validation with recurrent local validation
Authors: Alex Youssef, Michael Pencina, Anshul Thakur, Tingting Zhu, David Clifton, Nigam H. Shah
Abstract summary: External validation is often recommended to ensure the generalizability of ML models. It neither guarantees generalizability nor equates to a model's clinical usefulness. We submit that external validation is insufficient to establish ML models' safety or utility.
Score: 10.043347396280009
License: http://creativecommons.org/licenses/by/4.0/
Abstract: External validation is often recommended to ensure the generalizability of ML models. However, it neither guarantees generalizability nor equates to a model's clinical usefulness (the ultimate goal of any clinical decision-support tool). External validation is misaligned with current healthcare ML needs. First, patient data changes across time, geography, and facilities. These changes create significant volatility in the performance of a single fixed model (especially for deep learning models, which dominate clinical ML). Second, newer ML techniques, current market forces, and updated regulatory frameworks are enabling frequent updating and monitoring of individual deployed model instances. We submit that external validation is insufficient to establish ML models' safety or utility. Proposals to fix the external validation paradigm do not go far enough. Continued reliance on it as the ultimate test is likely to lead us astray. We propose the MLOps-inspired paradigm of recurring local validation as an alternative that ensures the validity of models while protecting against performance-disruptive data variability. This paradigm relies on site-specific reliability tests before every deployment, followed by regular and recurrent checks throughout the life cycle of the deployed algorithm. Initial and recurrent reliability tests protect against performance-disruptive distribution shifts, and concept drifts that jeopardize patient safety.

Related papers

From Internal Diagnosis to External Auditing: A VLM-Driven Paradigm for Online Test-Time Backdoor Defense [6.783000267839024]
PRISM achieves state-of-the-art performance, suppressing Attack Success Rate to 1% on CIFAR-10 while improving clean accuracy, establishing a new standard for model-agnostic, externalized security.
arXiv Detail & Related papers (2026-01-27T10:34:06Z)
Provably Safe Model Updates [6.7544474785403885]
We introduce a framework for provably safe model updates.<n>We show that relaxing the problem to parameterized abstract domains (orthotopes, zonotopes) yields a tractable primal-dual formulation.<n>This enables efficient certification of updates - independent of the data or algorithm used - by projecting them onto the safe domain.
arXiv Detail & Related papers (2025-12-01T17:19:53Z)
Filtering instances and rejecting predictions to obtain reliable models in healthcare [0.2524526956420465]
We propose a two-step data-centric approach to enhance the performance of Machine Learning models.<n>The first step involves leveraging Instance Hardness (IH) to filter problematic instances during training.<n>The second step introduces a confidence-based rejection mechanism during inference, ensuring that only reliable predictions are retained.
arXiv Detail & Related papers (2025-10-28T12:45:20Z)
Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models [87.66870367661342]
Large language models (LLMs) are used in AI applications in healthcare.<n>Red-teaming framework that continuously stress-test LLMs can reveal significant weaknesses in four safety-critical domains.<n>A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses.<n>Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
arXiv Detail & Related papers (2025-07-30T08:44:22Z)
Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations [67.35596444651037]
Vision-language models (VLMs) exhibit remarkable zero-shot capabilities but struggle with distribution shifts in downstream tasks when labeled data is unavailable.<n>We propose a Reliable Test-time Adaptation (ReTA) method that enhances reliability from two perspectives.
arXiv Detail & Related papers (2025-07-13T05:37:33Z)
Statistically Valid Post-Deployment Monitoring Should Be Standard for AI-Based Digital Health [14.256683587576935]
Only 9% of FDA-registered AI-based healthcare tools include a post-deployment surveillance plan.<n> Existing monitoring approaches are often manual, sporadic, and reactive.<n>We propose that the detection of changes in the data and model performance degradation should be framed as distinct statistical hypothesis testing problems.
arXiv Detail & Related papers (2025-06-06T03:04:44Z)
Aurora: Are Android Malware Classifiers Reliable and Stable under Distribution Shift? [51.12297424766236]
AURORA is a framework to evaluate malware classifiers based on their confidence quality and operational resilience.<n>AURORA is complemented by a set of metrics designed to go beyond point-in-time performance.<n>The fragility in SOTA frameworks across datasets of varying drift suggests the need for a return to the whiteboard.
arXiv Detail & Related papers (2025-05-28T20:22:43Z)
Secure Generalization through Stochastic Bidirectional Parameter Updates Using Dual-Gradient Mechanism [6.03163048890944]
Federated learning (FL) has gained increasing attention due to privacy-preserving collaborative training on decentralized clients. Recent research has underscored the risk of exposing private data to adversaries, even within FL frameworks. We generate diverse models for each client by using systematic perturbations in model parameters at a fine-grained level.
arXiv Detail & Related papers (2025-04-03T02:06:57Z)
LookAhead Tuning: Safer Language Models via Partial Answer Previews [62.529794567687354]
Fine-tuning enables large language models to adapt to specific domains, but often compromises their previously established safety alignment.<n>We introduce LookAhead Tuning, a lightweight and effective data-driven approach that preserves safety during fine-tuning.
arXiv Detail & Related papers (2025-03-24T18:11:42Z)
Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving [7.064497253920508]
Vision Foundation Models (VFMs) as feature extractors and density modeling techniques are proposed. A comparison with state-of-the-art binary OOD classification methods reveals that VFM embeddings with density estimation outperform existing approaches in identifying OOD inputs. Our method detects high-risk inputs likely to cause errors in downstream tasks, thereby improving overall performance.
arXiv Detail & Related papers (2025-01-14T12:51:34Z)
Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance. Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z)
Enhancing Security in Federated Learning through Adaptive Consensus-Based Model Update Validation [2.28438857884398]
This paper introduces an advanced approach for fortifying Federated Learning (FL) systems against label-flipping attacks. We propose a consensus-based verification process integrated with an adaptive thresholding mechanism. Our results indicate a significant mitigation of label-flipping attacks, bolstering the FL system's resilience.
arXiv Detail & Related papers (2024-03-05T20:54:56Z)
Monitoring Machine Learning Models: Online Detection of Relevant Deviations [0.0]
Machine learning models can degrade over time due to changes in data distribution or other factors. We propose a sequential monitoring scheme to detect relevant changes. Our research contributes a practical solution for distinguishing between minor fluctuations and meaningful degradations.
arXiv Detail & Related papers (2023-09-26T18:46:37Z)
SureFED: Robust Federated Learning via Uncertainty-Aware Inward and Outward Inspection [29.491675102478798]
We introduce SureFED, a novel framework for robust federated learning. SureFED establishes trust using the local information of benign clients. We theoretically prove the robustness of our algorithm against data and model poisoning attacks.
arXiv Detail & Related papers (2023-08-04T23:51:05Z)
A Generative Framework for Low-Cost Result Validation of Machine Learning-as-a-Service Inference [4.478182379059458]
Fides is a novel framework for real-time integrity validation of ML-as-a-Service (ML) inference. Fides features a client-side attack detection model that uses statistical analysis and divergence measurements to identify, with a high likelihood, if the service model is under attack. We devised a generative adversarial network framework for training the attack detection and re-classification models.
arXiv Detail & Related papers (2023-03-31T19:17:30Z)
Safe AI for health and beyond -- Monitoring to transform a health service [51.8524501805308]
We will assess the infrastructure required to monitor the outputs of a machine learning algorithm. We will present two scenarios with examples of monitoring and updates of models.
arXiv Detail & Related papers (2023-03-02T17:27:45Z)
Distillation to Enhance the Portability of Risk Models Across Institutions with Large Patient Claims Database [12.452703677540505]
We investigate the practicality of model portability through a cross-site evaluation of readmission prediction models. We apply a recurrent neural network, augmented with self-attention and blended with expert features, to build readmission prediction models. Our experiments show that direct application of ML models trained at one institution and tested at another institution perform worse than models trained and tested at the same institution.
arXiv Detail & Related papers (2022-07-06T05:26:32Z)
Adaptive Memory Networks with Self-supervised Learning for Unsupervised Anomaly Detection [54.76993389109327]
Unsupervised anomaly detection aims to build models to detect unseen anomalies by only training on the normal data. We propose a novel approach called Adaptive Memory Network with Self-supervised Learning (AMSL) to address these challenges. AMSL incorporates a self-supervised learning module to learn general normal patterns and an adaptive memory fusion module to learn rich feature representations.
arXiv Detail & Related papers (2022-01-03T03:40:21Z)
Certified Adversarial Defenses Meet Out-of-Distribution Corruptions: Benchmarking Robustness and Simple Baselines [65.0803400763215]
This work critically examines how adversarial robustness guarantees change when state-of-the-art certifiably robust models encounter out-of-distribution data. We propose a novel data augmentation scheme, FourierMix, that produces augmentations to improve the spectral coverage of the training data. We find that FourierMix augmentations help eliminate the spectral bias of certifiably robust models enabling them to achieve significantly better robustness guarantees on a range of OOD benchmarks.
arXiv Detail & Related papers (2021-12-01T17:11:22Z)
MEMO: Test Time Robustness via Adaptation and Augmentation [131.28104376280197]
We study the problem of test time robustification, i.e., using the test input to improve model robustness. Recent prior works have proposed methods for test time adaptation, however, they each introduce additional assumptions. We propose a simple approach that can be used in any test setting where the model is probabilistic and adaptable.
arXiv Detail & Related papers (2021-10-18T17:55:11Z)
Unlabelled Data Improves Bayesian Uncertainty Calibration under Covariate Shift [100.52588638477862]
We develop an approximate Bayesian inference scheme based on posterior regularisation. We demonstrate the utility of our method in the context of transferring prognostic models of prostate cancer across globally diverse populations.
arXiv Detail & Related papers (2020-06-26T13:50:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.