Related papers: EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models

EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models

URL: http://arxiv.org/abs/2509.20146v1
Date: Wed, 24 Sep 2025 14:09:55 GMT
Title: EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models
Authors: Botai Yuan, Yutian Zhou, Yingjie Wang, Fushuo Huo, Yongcheng Jing, Li Shen, Ying Wei, Zhiqi Shen, Ziwei Liu, Tianwei Zhang, Jie Yang, Dacheng Tao,
Abstract summary: Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety.<n>We study sycophancy -- models' tendency to uncritically echo user-provided information.<n>We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs.
Score: 82.43729208063468
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety. We study sycophancy -- models' tendency to uncritically echo user-provided information -- in high-stakes clinical settings. We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs. It contains 2,122 images across 18 departments and 20 modalities with 90 prompts that simulate biased inputs from patients, medical students, and physicians. We evaluate medical-specific, open-source, and proprietary LVLMs. All exhibit substantial sycophancy; the best proprietary model (Claude 3.7 Sonnet) still shows 45.98% sycophancy, and GPT-4.1 reaches 59.15%. Many medical-specific models exceed 95% sycophancy despite only moderate accuracy. Fine-grained analyses by bias type, department, perceptual granularity, and modality identify factors that increase susceptibility. We further show that higher data quality/diversity and stronger domain knowledge reduce sycophancy without harming unbiased accuracy. EchoBench also serves as a testbed for mitigation: simple prompt-level interventions (negative prompting, one-shot, few-shot) produce consistent reductions and motivate training- and decoding-time strategies. Our findings highlight the need for robust evaluation beyond accuracy and provide actionable guidance toward safer, more trustworthy medical LVLMs.

Related papers

Measuring Stability Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology [34.80893325510028]
Small open-source medical large language models (LLMs) offer promising opportunities for low-resource deployment and broader accessibility.<n>We use coupled to human evaluation and clinical review to assess six small open-source medical LLMs.
arXiv Detail & Related papers (2025-12-26T14:30:53Z)
A DeepSeek-Powered AI System for Automated Chest Radiograph Interpretation in Clinical Practice [83.11942224668127]
Janus-Pro-CXR (1B) is a chest X-ray interpretation system based on DeepSeek Janus-Pro model.<n>Our system outperforms state-of-the-art X-ray report generation models in automated report generation.
arXiv Detail & Related papers (2025-12-23T13:26:13Z)
Shallow Robustness, Deep Vulnerabilities: Multi-Turn Evaluation of Medical LLMs [9.291589998223696]
We introduce MedQA-Followup, a framework for evaluating multi-turn robustness in medical question answering.<n>Using controlled interventions on the MedQA dataset, we evaluate five state-of-the-art LLMs.<n>We find that while models perform reasonably well under shallow perturbations, they exhibit severe vulnerabilities in multi-turn settings.
arXiv Detail & Related papers (2025-10-14T08:04:18Z)
Benchmarking and Mitigate Sycophancy in Medical Vision-Language Models [21.353225217216252]
Vision language models often exhibit sycophantic behavior prioritizing alignment with user phrasing social cues or perceived authority over evidence based reasoning.<n>This study evaluate clinical sycophancy in medical visual question answering through a novel clinically grounded benchmark.
arXiv Detail & Related papers (2025-09-26T07:02:22Z)
Evaluating Large Language Models for Evidence-Based Clinical Question Answering [4.101088122511548]
Large Language Models (LLMs) have demonstrated substantial progress in biomedical and clinical applications.<n>We curate a benchmark drawing from Cochrane systematic reviews and clinical guidelines.<n>We observe consistent performance patterns across sources and clinical domains.
arXiv Detail & Related papers (2025-09-13T15:03:34Z)
PRECISE-AS: Personalized Reinforcement Learning for Efficient Point-of-Care Echocardiography in Aortic Stenosis Diagnosis [6.276251898178271]
Aortic stenosis (AS) is a life-threatening condition caused by a narrowing of the aortic valve, leading to impaired blood flow.<n>Access to echocardiography (echo) is often limited due to resource constraints, particularly in rural and underserved areas.<n>We propose a reinforcement learning (RL)-driven active video acquisition framework that dynamically selects each patient's most informative echo videos.
arXiv Detail & Related papers (2025-09-02T23:47:43Z)
Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models [57.73472878679636]
We introduce Med-RewardBench, the first benchmark specifically designed to evaluate medical reward models and judges.<n>Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases.<n>A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions.
arXiv Detail & Related papers (2025-08-29T08:58:39Z)
MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine [69.08855631283829]
We introduce Med Omni-45 Degrees, a benchmark designed to quantify safety-performance trade-offs under manipulative hint conditions.<n>It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA.<n>Results show a consistent safety-performance trade-off, with no model surpassing the diagonal.
arXiv Detail & Related papers (2025-08-22T08:38:16Z)
Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models [87.66870367661342]
Large language models (LLMs) are used in AI applications in healthcare.<n>Red-teaming framework that continuously stress-test LLMs can reveal significant weaknesses in four safety-critical domains.<n>A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses.<n>Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
arXiv Detail & Related papers (2025-07-30T08:44:22Z)
A Method for the Architecture of a Medical Vertical Large Language Model Based on Deepseek R1 [6.589206192038366]
This paper presents an efficient lightweight medical large language model architecture that addresses knowledge acquisition, model compression, and computational enhancement challenges.<n>We design a knowledge transfer pipeline from DeepSeek-R1-Distill-70B to DeepSeek-R1-Distill-7B using Low-Rank Adaptation (LoRA) for precise medical knowledge retention.<n>Our approach maintains 92.1% accuracy on USMLE while reducing memory consumption by 64.7% and latency by 12.4% compared to baseline inference models.
arXiv Detail & Related papers (2025-04-25T14:28:29Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
Detecting Bias and Enhancing Diagnostic Accuracy in Large Language Models for Healthcare [0.2302001830524133]
Biased AI-generated medical advice and misdiagnoses can jeopardize patient safety. This study introduces new resources designed to promote ethical and precise AI in healthcare.
arXiv Detail & Related papers (2024-10-09T06:00:05Z)
Self-supervised contrastive learning of echocardiogram videos enables label-efficient cardiac disease diagnosis [48.64462717254158]
We developed a self-supervised contrastive learning approach, EchoCLR, to catered to echocardiogram videos. When fine-tuned on small portions of labeled data, EchoCLR pretraining significantly improved classification performance for left ventricular hypertrophy (LVH) and aortic stenosis (AS) EchoCLR is unique in its ability to learn representations of medical videos and demonstrates that SSL can enable label-efficient disease classification from small, labeled datasets.
arXiv Detail & Related papers (2022-07-23T19:17:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.