Related papers: EyePCR: A Comprehensive Benchmark for Fine-Grained Perception, Knowledge Comprehension and Clinical Reasoning in Ophthalmic Surgery

EyePCR: A Comprehensive Benchmark for Fine-Grained Perception, Knowledge Comprehension and Clinical Reasoning in Ophthalmic Surgery

URL: http://arxiv.org/abs/2509.15596v2
Date: Thu, 02 Oct 2025 12:55:24 GMT
Title: EyePCR: A Comprehensive Benchmark for Fine-Grained Perception, Knowledge Comprehension and Clinical Reasoning in Ophthalmic Surgery
Authors: Gui Wang, Yang Wennuo, Xusen Ma, Zehao Zhong, Zhuoru Wu, Ende Wu, Rong Qu, Wooi Ping Cheah, Jianfeng Ren, Linlin Shen,
Abstract summary: We develop textbfEyePCR, a large-scale benchmark for ophthalmic surgery analysis, grounded in structured clinical knowledge.<n>EyePCR offers a richly annotated corpus with more than 210k VQAs, which cover 1048 fine-grained attributes for multi-view perception.<n>Rich annotations facilitate in-depth cognitive analysis, simulating how surgeons perceive visual cues and combine them with domain knowledge to make decisions.
Score: 42.23133882924834
License: http://creativecommons.org/licenses/by/4.0/
Abstract: MLLMs (Multimodal Large Language Models) have showcased remarkable capabilities, but their performance in high-stakes, domain-specific scenarios like surgical settings, remains largely under-explored. To address this gap, we develop \textbf{EyePCR}, a large-scale benchmark for ophthalmic surgery analysis, grounded in structured clinical knowledge to evaluate cognition across \textit{Perception}, \textit{Comprehension} and \textit{Reasoning}. EyePCR offers a richly annotated corpus with more than 210k VQAs, which cover 1048 fine-grained attributes for multi-view perception, medical knowledge graph of more than 25k triplets for comprehension, and four clinically grounded reasoning tasks. The rich annotations facilitate in-depth cognitive analysis, simulating how surgeons perceive visual cues and combine them with domain knowledge to make decisions, thus greatly improving models' cognitive ability. In particular, \textbf{EyePCR-MLLM}, a domain-adapted variant of Qwen2.5-VL-7B, achieves the highest accuracy on MCQs for \textit{Perception} among compared models and outperforms open-source models in \textit{Comprehension} and \textit{Reasoning}, rivalling commercial models like GPT-4.1. EyePCR reveals the limitations of existing MLLMs in surgical cognition and lays the foundation for benchmarking and enhancing clinical reliability of surgical video understanding models.

Related papers

MMedExpert-R1: Strengthening Multimodal Medical Reasoning via Domain-Specific Adaptation and Clinical Guideline Reinforcement [63.82954136824963]
Medical Vision-Language Models excel at perception tasks with complex clinical reasoning required in real-world scenarios.<n>We propose a novel reasoning MedVLM that addresses these challenges through domain-specific adaptation and guideline reinforcement.
arXiv Detail & Related papers (2026-01-16T02:32:07Z)
NEURO-GUARD: Neuro-Symbolic Generalization and Unbiased Adaptive Routing for Diagnostics -- Explainable Medical AI [0.6345042809319409]
We present NEURO-GUARD, a knowledge-guided vision framework that integrates Vision Transformers (ViTs) with language-driven reasoning to improve performance.<n> NEURO-GUARD employs a retrieval-augmented generation (RAG) mechanism for self-verification, in which a large language model (LLM) iteratively generates, evaluates, and refines feature-extraction code for medical images.<n>Experiments on diabetic retinopathy classification across four benchmark datasets demonstrate that NEURO-GUARD improves accuracy by 6.2% over a ViT-only baseline and achieves a 5% gain in domain generalization.
arXiv Detail & Related papers (2025-12-20T02:32:15Z)
Med-CMR: A Fine-Grained Benchmark Integrating Visual Evidence and Clinical Logic for Medical Complex Multimodal Reasoning [37.6854362777847]
We present Med-CMR, a fine-grained Medical Complex Multimodal benchmark.<n>Med-CMR distinguishes from existing counterparts by three core features.<n>We evaluate 18 state-of-the-art MLLMs with Med-CMR, revealing GPT-5 as the top-performing commercial model.
arXiv Detail & Related papers (2025-11-30T09:56:50Z)
Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reasoning [0.5360375691077625]
FundusExpert is an ophthalmology-specific MLLM with integrated positioning-diagnosis reasoning capabilities.<n>FundusGen is a dataset constructed through the intelligent Fundus-Engine system.
arXiv Detail & Related papers (2025-07-23T14:19:30Z)
Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z)
Evaluating Large Language Models for Multimodal Simulated Ophthalmic Decision-Making in Diabetic Retinopathy and Glaucoma Screening [37.69303106863453]
Large language models (LLMs) can simulate clinical reasoning based on natural language prompts, but their utility in ophthalmology is largely unexplored.<n>This study evaluated GPT-4's ability to interpret structured textual descriptions of retinal fundus photographs.<n>We conducted a retrospective diagnostic validation study using 300 annotated fundus images.
arXiv Detail & Related papers (2025-07-02T01:35:59Z)
EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis [62.00431604976949]
EndoBench is the first comprehensive benchmark specifically designed to assess MLLMs across the full spectrum of endoscopic practice.<n>We benchmark 23 state-of-the-art models, including general-purpose, medical-specialized, and proprietary MLLMs.<n>Our experiments reveal: proprietary MLLMs outperform open-source and medical-specialized models overall, but still trail human experts.
arXiv Detail & Related papers (2025-05-29T16:14:34Z)
How Good is my Histopathology Vision-Language Foundation Model? A Holistic Benchmark [21.47220651857942]
Histopathology vision-language foundation models (VLMs) have gained popularity due to their enhanced performance and generalizability across different downstream tasks.<n>Most existing histopathology benchmarks are either unimodal or limited in terms of diversity of clinical tasks, organs, and acquisition instruments, as well as their partial availability to the public due to patient data privacy.<n>We introduce HistoVL, a fully open-source comprehensive benchmark comprising images acquired using up to 11 various acquisition tools and captions by incorporating class names and diverse pathology descriptions.
arXiv Detail & Related papers (2025-03-17T09:45:22Z)
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding [61.26026947423187]
Human experts excel at fine-grained visual discrimination by leveraging domain knowledge to refine perceptual features.<n>Current Multimodal Large Language Models (MLLMs) struggle to integrate reasoning into visual perception.<n>We propose DeepPerception, an MLLM enhanced with cognitive visual perception capabilities.
arXiv Detail & Related papers (2025-03-17T04:06:34Z)
Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval [61.70489848327436]
KARE is a novel framework that integrates knowledge graph (KG) community-level retrieval with large language models (LLMs) reasoning.<n>Extensive experiments demonstrate that KARE outperforms leading models by up to 10.8-15.0% on MIMIC-III and 12.6-12.7% on MIMIC-IV for mortality and readmission predictions.
arXiv Detail & Related papers (2024-10-06T18:46:28Z)
Beyond the Hype: A dispassionate look at vision-language models in medical scenario [3.4299097748670255]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across diverse tasks.<n>Their performance and reliability in specialized domains such as medicine remain insufficiently assessed.<n>We introduce RadVUQA, a novel benchmark to comprehensively evaluate existing LVLMs.
arXiv Detail & Related papers (2024-08-16T12:32:44Z)
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals. GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z)
A Foundation Language-Image Model of the Retina (FLAIR): Encoding Expert Knowledge in Text Supervision [17.875098424936542]
We present FLAIR, a pre-trained vision-language model for universal retinal fundus image understanding.<n>We compiled 38 open-access, mostly categorical fundus imaging datasets from various sources.<n>We integrate the expert's domain knowledge in the form of descriptive textual prompts, during both pre-training and zero-shot inference.
arXiv Detail & Related papers (2023-08-15T17:39:52Z)
Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation [116.87918100031153]
We propose a Cross-modal clinical Graph Transformer (CGT) for ophthalmic report generation (ORG) CGT injects clinical relation triples into the visual features as prior knowledge to drive the decoding procedure. Experiments on the large-scale FFA-IR benchmark demonstrate that the proposed CGT is able to outperform previous benchmark methods.
arXiv Detail & Related papers (2022-06-04T13:16:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.