Related papers: Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models

Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models

URL: http://arxiv.org/abs/2508.00923v1
Date: Wed, 30 Jul 2025 08:44:22 GMT
Title: Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models
Authors: Jiazhen Pan, Bailiang Jian, Paul Hager, Yundi Zhang, Che Liu, Friedrike Jungmann, Hongwei Bran Li, Chenyu You, Junde Wu, Jiayuan Zhu, Fenglin Liu, Yuyuan Liu, Niklas Bubeck, Christian Wachinger, Chen, Chen, Zhenyu Gong, Cheng Ouyang, Georgios Kaissis, Benedikt Wiestler, Daniel Rueckert,
Abstract summary: Large language models (LLMs) are used in AI applications in healthcare.<n>Red-teaming framework that continuously stress-test LLMs can reveal significant weaknesses in four safety-critical domains.<n>A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses.<n>Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
Score: 87.66870367661342
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Ensuring the safety and reliability of large language models (LLMs) in clinical practice is critical to prevent patient harm and promote trustworthy healthcare applications of AI. However, LLMs are advancing so rapidly that static safety benchmarks often become obsolete upon publication, yielding only an incomplete and sometimes misleading picture of model trustworthiness. We demonstrate that a Dynamic, Automatic, and Systematic (DAS) red-teaming framework that continuously stress-tests LLMs can reveal significant weaknesses of current LLMs across four safety-critical domains: robustness, privacy, bias/fairness, and hallucination. A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses, uncovering vulnerabilities in real time without human intervention. Applying DAS to 15 proprietary and open-source LLMs revealed a stark contrast between static benchmark performance and vulnerability under adversarial pressure. Despite a median MedQA accuracy exceeding 80\%, 94\% of previously correct answers failed our dynamic robustness tests. We observed similarly high failure rates across other domains: privacy leaks were elicited in 86\% of scenarios, cognitive-bias priming altered clinical recommendations in 81\% of fairness tests, and we identified hallucination rates exceeding 66\% in widely used models. Such profound residual risks are incompatible with routine clinical practice. By converting red-teaming from a static checklist into a dynamic stress-test audit, DAS red-teaming offers the surveillance that hospitals/regulators/technology vendors require as LLMs become embedded in patient chatbots, decision-support dashboards, and broader healthcare workflows. Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.

Related papers

Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models [1.4008409814572673]
We introduce Prompt4Trust, the first reinforcement learning framework for prompt augmentation targeting confidence calibration in MLLMs.<n>Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making.<n>Our framework showed promising zero-shot generalizations to larger MLLMs in our experiments.
arXiv Detail & Related papers (2025-07-12T13:21:10Z)
Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation [69.63626052852153]
We propose a proof-of-concept framework that projects how model-generated advice could propagate through societal systems.<n>We also introduce a dataset of 100 indirect harm scenarios, testing models' ability to foresee adverse, non-obvious outcomes from seemingly harmless user prompts.
arXiv Detail & Related papers (2025-06-26T02:28:58Z)
Understanding and Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding [59.50808215134678]
This study introduces Trust-videoLLMs, a first comprehensive benchmark evaluating 23 state-of-the-art videoLLMs.<n>Results reveal significant limitations in dynamic scene comprehension, cross-modal resilience and real-world risk mitigation.
arXiv Detail & Related papers (2025-06-14T04:04:54Z)
CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs [7.597770587484936]
We introduce CARES (Clinical Adversarial Robustness and Evaluation of Safety), a benchmark for evaluating medical large language models (LLMs) safety in healthcare.<n> CARES includes over 18,000 prompts spanning eight medical safety principles, four harm levels, and four prompting styles to simulate both malicious and benign use cases.<n>Our analysis reveals that many state-of-the-art LLMs remain vulnerable to jailbreaks that subtly rephrase harmful prompts, while also over-refusing safe but atypically phrased queries.
arXiv Detail & Related papers (2025-05-16T16:25:51Z)
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models [50.34706204154244]
Acquiring reasoning capabilities catastrophically degrades inherited safety alignment.<n>Certain scenarios suffer 25 times higher attack rates.<n>Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction.
arXiv Detail & Related papers (2025-04-09T06:53:23Z)
Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies [11.0505830548286]
This study introduces a granular error taxonomy through systematic analysis of top 10 models on MedBench.<n> Evaluation of 10 leading models reveals vulnerabilities, despite achieving 0.86 accuracy in medical knowledge recall.<n>Our analysis uncovers systemic weaknesses in knowledge boundary enforcement and multi-step reasoning.
arXiv Detail & Related papers (2025-03-10T13:28:25Z)
Safeguarding AI in Medical Imaging: Post-Hoc Out-of-Distribution Detection with Normalizing Flows [3.3968168503957625]
Current OOD detection methods demand impractical retraining or modifications to pre-trained models.<n>We propose a post-hoc normalizing flow-based approach that seamlessly integrates with existing pre-trained models.<n>Our method achieved an AUROC of 84.61%, outperforming state-of-the-art methods like ViM (80.65%) and MDS (80.87%)
arXiv Detail & Related papers (2025-02-17T10:31:24Z)
Agent-SafetyBench: Evaluating the Safety of LLM Agents [72.92604341646691]
We introduce Agent-SafetyBench, a benchmark designed to evaluate the safety of large language models (LLMs)<n>Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions.<n>Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%.
arXiv Detail & Related papers (2024-12-19T02:35:15Z)
Inadequacy of common stochastic neural networks for reliable clinical decision support [0.4262974002462632]
Widespread adoption of AI for medical decision making is still hindered due to ethical and safety-related concerns. Common deep learning approaches, however, have the tendency towards overconfidence under data shift. This study investigates their actual reliability in clinical applications.
arXiv Detail & Related papers (2024-01-24T18:49:30Z)
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection. We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.