Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study
- URL: http://arxiv.org/abs/2506.06232v2
- Date: Tue, 08 Jul 2025 07:40:27 GMT
- Title: Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study
- Authors: Leon Mayer, Tim Rädsch, Dominik Michael, Lucas Luttner, Amine Yamlahi, Evangelia Christodoulou, Patrick Godau, Marcel Knopp, Annika Reinke, Fiona Kolbinger, Lena Maier-Hein,
- Abstract summary: We present the first large-scale study assessing the capabilities of Vision Language Models (VLMs) for endoscopic tasks.<n>Using a diverse set of state-of-the-art models, multiple surgical datasets, and extensive human reference annotations, we address three key research questions.<n>Our results reveal that VLMs can effectively perform basic surgical perception tasks, such as object counting and localization, with performance levels comparable to general domain tasks.
- Score: 0.6120768859742071
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While traditional computer vision models have historically struggled to generalize to endoscopic domains, the emergence of foundation models has shown promising cross-domain performance. In this work, we present the first large-scale study assessing the capabilities of Vision Language Models (VLMs) for endoscopic tasks with a specific focus on laparoscopic surgery. Using a diverse set of state-of-the-art models, multiple surgical datasets, and extensive human reference annotations, we address three key research questions: (1) Can current VLMs solve basic perception tasks on surgical images? (2) Can they handle advanced frame-based endoscopic scene understanding tasks? and (3) How do specialized medical VLMs compare to generalist models in this context? Our results reveal that VLMs can effectively perform basic surgical perception tasks, such as object counting and localization, with performance levels comparable to general domain tasks. However, their performance deteriorates significantly when the tasks require medical knowledge. Notably, we find that specialized medical VLMs currently underperform compared to generalist models across both basic and advanced surgical tasks, suggesting that they are not yet optimized for the complexity of surgical environments. These findings highlight the need for further advancements to enable VLMs to handle the unique challenges posed by surgery. Overall, our work provides important insights for the development of next-generation endoscopic AI systems and identifies key areas for improvement in medical visual language models.
Related papers
- Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation [56.52520416420957]
We propose Multimodal Causal-Driven Representation Learning (MCDRL) to tackle domain generalization in medical image segmentation.<n>MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
arXiv Detail & Related papers (2025-08-07T03:41:41Z) - SurgVisAgent: Multimodal Agentic Model for Versatile Surgical Visual Enhancement [8.337819078911405]
SurgVisAgent is an end-to-end intelligent surgical vision agent built on multimodal large language models (MLLMs)<n>It dynamically identifies distortion categories and severity levels in endoscopic images, enabling it to perform a variety of enhancement tasks.<n>We construct a benchmark simulating real-world surgical distortions, on which extensive experiments demonstrate that SurgVisAgent surpasses traditional single-task models.
arXiv Detail & Related papers (2025-07-03T03:00:26Z) - SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model [55.13206879750197]
SurgVidLM is the first video language model designed to address both full and fine-grained surgical video comprehension.<n>We introduce the StageFocus mechanism which is a two-stage framework performing the multi-grained, progressive understanding of surgical videos.<n> Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs in both full and fine-grained video understanding tasks.
arXiv Detail & Related papers (2025-06-22T02:16:18Z) - Conquering the Retina: Bringing Visual in-Context Learning to OCT [5.012883033803268]
In this work, we explore how to train generalist models for the domain of retinal optical coherence tomography using visual in-context learning (VICL)<n>We extensively evaluate a state-of-the-art medical VICL approach on multiple retinal OCT datasets, establishing a first baseline to highlight the potential and current limitations of in-context learning for OCT.
arXiv Detail & Related papers (2025-06-18T07:28:47Z) - SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence [72.10889173696928]
We propose SurgVLM, one of the first large vision-language foundation models for surgical intelligence.<n>We construct a large-scale multimodal surgical database, SurgVLM-DB, spanning more than 16 surgical types and 18 anatomical structures.<n>Building upon this comprehensive dataset, we propose SurgVLM, which is built upon Qwen2.5-VL, and undergoes instruction tuning to 10+ surgical tasks.
arXiv Detail & Related papers (2025-06-03T07:44:41Z) - Surgical Foundation Model Leveraging Compression and Entropy Maximization for Image-Guided Surgical Assistance [50.486523249499115]
Real-time video understanding is critical to guide procedures in minimally invasive surgery (MIS)<n>We propose Compress-to-Explore (C2E), a novel self-supervised framework to learn compact, informative representations from surgical videos.<n>C2E uses entropy-maximizing decoders to compress images while preserving clinically relevant details, improving encoder performance without labeled data.
arXiv Detail & Related papers (2025-05-16T14:02:24Z) - Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery: Challenges and opportunities [2.9212404280476267]
Vision-language models (VLMs) can be trained on large volumes of raw image-text pairs and exhibit strong adaptability.<n>We conduct a benchmarking study of several popular VLMs across diverse laparoscopic datasets.<n>Our findings reveal a mismatch between prediction accuracy and visual grounding, indicating that models may make correct predictions while focusing on irrelevant areas of the image.
arXiv Detail & Related papers (2025-05-16T00:42:18Z) - EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model [51.66031028717933]
Medical Large Vision-Language Models (Med-LVLMs) demonstrate significant potential in healthcare.<n>Currently, intelligent ophthalmic diagnosis faces three major challenges: (i) Data; (ii) Benchmark; and (iii) Model.<n>We propose the Eyecare Kit, which tackles the aforementioned three key challenges with the tailored dataset, benchmark and model.
arXiv Detail & Related papers (2025-04-18T12:09:15Z) - Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence [1.1765603103920352]
Large Vision-Language Models offer a new paradigm for AI-driven image understanding.<n>This flexibility holds particular promise across medicine, where expert-annotated data is scarce.<n>Here, we present a comprehensive analysis of 11 state-of-the-art VLMs across 17 key visual understanding tasks in surgical AI.
arXiv Detail & Related papers (2025-04-03T17:42:56Z) - Guiding Medical Vision-Language Models with Explicit Visual Prompts: Framework Design and Comprehensive Exploration of Prompt Variations [15.052986179046076]
We introduce MedVP, a pioneering framework that integrates medical entity extraction, visual prompt generation, and dataset adaptation for visual prompt guided fine-tuning.<n>We successfully outperform recent state-of-the-art large models across multiple medical VQA datasets.
arXiv Detail & Related papers (2025-01-04T21:23:36Z) - Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models [1.4042211166197214]
We introduce an LVLM specifically designed for surgical scenarios.
We establish a LVLM model, Surgical-LLaVA, fine-tuned on instruction following data of surgical scenarios.
Experiments demonstrate that Surgical-LLaVA exhibits impressive multi-modal chat abilities in surgical contexts.
arXiv Detail & Related papers (2024-10-13T07:12:35Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Artificial General Intelligence for Medical Imaging Analysis [92.3940918983821]
Large-scale Artificial General Intelligence (AGI) models have achieved unprecedented success in a variety of general domain tasks.
These models face notable challenges arising from the medical field's inherent complexities and unique characteristics.
This review aims to offer insights into the future implications of AGI in medical imaging, healthcare, and beyond.
arXiv Detail & Related papers (2023-06-08T18:04:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.