Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models
- URL: http://arxiv.org/abs/2601.06750v1
- Date: Sun, 11 Jan 2026 02:20:40 GMT
- Title: Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models
- Authors: Shaonan Liu, Guo Yu, Xiaoling Luo, Shiyi Zheng, Wenting Chen, Jie Liu, Linlin Shen,
- Abstract summary: We introduce MedGaze-Bench, the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding across surgery, emergency simulation, and diagnostic interpretation.<n>Our benchmark addresses three fundamental challenges: visual homogeneity of anatomical structures, strict temporal-causal dependencies in clinical, and implicit adherence to safety protocols.
- Score: 48.95516224614331
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical Multimodal Large Language Models (Med-MLLMs) require egocentric clinical intent understanding for real-world deployment, yet existing benchmarks fail to evaluate this critical capability. To address these challenges, we introduce MedGaze-Bench, the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding across surgery, emergency simulation, and diagnostic interpretation. Our benchmark addresses three fundamental challenges: visual homogeneity of anatomical structures, strict temporal-causal dependencies in clinical workflows, and implicit adherence to safety protocols. We propose a Three-Dimensional Clinical Intent Framework evaluating: (1) Spatial Intent: discriminating precise targets amid visual noise, (2) Temporal Intent: inferring causal rationale through retrospective and prospective reasoning, and (3) Standard Intent: verifying protocol compliance through safety checks. Beyond accuracy metrics, we introduce Trap QA mechanisms to stress-test clinical reliability by penalizing hallucinations and cognitive sycophancy. Experiments reveal current MLLMs struggle with egocentric intent due to over-reliance on global features, leading to fabricated observations and uncritical acceptance of invalid instructions.
Related papers
- ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels [39.33170904610862]
Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care.<n>We introduce ClinConsensus, a Chinese medical benchmark curated, validated and quality-controlled by clinical experts.<n> ClinConsensus comprises 2500 open-ended cases spanning the full continuum of care--from prevention and intervention to long-term follow-up--covering 36 medical specialties, 12 common clinical task types, and progressively increasing levels of complexity.
arXiv Detail & Related papers (2026-03-02T17:17:18Z) - M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding [66.78251988482222]
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning.<n>Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path.<n>M3CoTBench aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare.
arXiv Detail & Related papers (2026-01-13T17:42:27Z) - Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models [51.91760712805404]
We introduce VivaBench, a benchmark for evaluating sequential clinical reasoning in large language models (LLMs)<n>Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training.<n>Our analysis identified several failure modes that mirror common cognitive errors in clinical practice.
arXiv Detail & Related papers (2025-10-11T16:24:35Z) - Think as a Doctor: An Interpretable AI Approach for ICU Mortality Prediction [7.809857381429602]
We propose a novel ICU mortality prediction framework that delivers intrinsic interpretability while integrating all three elements of the ICU decision-making practices into its reasoning process.<n>ProtoDoctor features two key innovations: the Prognostic Clinical Course Identification module and the Demographic Heterogeneity Recognition module.<n>Extensive empirical evaluations demonstrate that ProtoDoctor outperforms state-of-the-art baselines in predictive accuracy.
arXiv Detail & Related papers (2025-10-11T14:57:07Z) - RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis [56.373297358647655]
Retrieval-Augmented Diagnosis (RAD) is a novel framework that injects external knowledge into multimodal models directly on downstream tasks.<n>RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss transformer, and a dual decoder.
arXiv Detail & Related papers (2025-09-24T10:36:14Z) - ControlMed: Adding Reasoning Control to Medical Language Model [1.0207955314209531]
Reasoning Large Language Models (LLMs) with enhanced accuracy and explainability are increasingly being adopted in the medical domain.<n>Existing reasoning LLMs often generate unnecessarily lengthy reasoning processes, leading to significant computational overhead and response latency.<n>We introduce textbfControlMed, a medical language model that enables users to actively control the length of the reasoning process at inference time.
arXiv Detail & Related papers (2025-07-30T10:17:07Z) - Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z) - AGIR: Assessing 3D Gait Impairment with Reasoning based on LLMs [0.0]
gait impairment plays an important role in early diagnosis, disease monitoring, and treatment evaluation for neurodegenerative diseases.<n>Recent deep learning-based approaches have consistently improved classification accuracies, but they often lack interpretability.<n>We introduce AGIR, a novel pipeline consisting of a pre-trained VQ-VAE motion tokenizer and a Large Language Model (LLM) fine-tuned over pairs of motion tokens.
arXiv Detail & Related papers (2025-03-23T17:12:16Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.