Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery: Challenges and opportunities
- URL: http://arxiv.org/abs/2505.10764v2
- Date: Tue, 27 May 2025 23:19:43 GMT
- Title: Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery: Challenges and opportunities
- Authors: Jiajun Cheng, Xianwu Zhao, Shan Lin,
- Abstract summary: Vision-language models (VLMs) can be trained on large volumes of raw image-text pairs and exhibit strong adaptability.<n>We conduct a benchmarking study of several popular VLMs across diverse laparoscopic datasets.<n>Our findings reveal a mismatch between prediction accuracy and visual grounding, indicating that models may make correct predictions while focusing on irrelevant areas of the image.
- Score: 2.9212404280476267
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Minimally invasive surgery (MIS) presents significant visual challenges, including a limited field of view, specular reflections, and inconsistent lighting conditions due to the small incision and the use of endoscopes. Over the past decade, many machine learning and deep learning models have been developed to identify and detect instruments and anatomical structures in surgical videos. However, these models are typically trained on manually labeled, procedure- and task-specific datasets that are relatively small, resulting in limited generalization to unseen data.In practice, hospitals generate a massive amount of raw surgical data every day, including videos captured during various procedures. Labeling this data is almost impractical, as it requires highly specialized expertise. The recent success of vision-language models (VLMs), which can be trained on large volumes of raw image-text pairs and exhibit strong adaptability, offers a promising alternative for leveraging unlabeled surgical data. While some existing work has explored applying VLMs to surgical tasks, their performance remains limited. To support future research in developing more effective VLMs for surgical applications, this paper aims to answer a key question: How well do existing VLMs, both general-purpose and surgery-specific perform on surgical data, and what types of scenes do they struggle with? To address this, we conduct a benchmarking study of several popular VLMs across diverse laparoscopic datasets. Specifically, we visualize the model's attention to identify which regions of the image it focuses on when making predictions for surgical tasks. We also propose a metric to evaluate whether the model attends to task-relevant regions. Our findings reveal a mismatch between prediction accuracy and visual grounding, indicating that models may make correct predictions while focusing on irrelevant areas of the image.
Related papers
- Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study [0.6120768859742071]
We present the first large-scale study assessing the capabilities of Vision Language Models (VLMs) for endoscopic tasks.<n>Using a diverse set of state-of-the-art models, multiple surgical datasets, and extensive human reference annotations, we address three key research questions.<n>Our results reveal that VLMs can effectively perform basic surgical perception tasks, such as object counting and localization, with performance levels comparable to general domain tasks.
arXiv Detail & Related papers (2025-06-06T16:53:12Z) - Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence [1.1765603103920352]
Large Vision-Language Models offer a new paradigm for AI-driven image understanding.<n>This flexibility holds particular promise across medicine, where expert-annotated data is scarce.<n>Here, we present a comprehensive analysis of 11 state-of-the-art VLMs across 17 key visual understanding tasks in surgical AI.
arXiv Detail & Related papers (2025-04-03T17:42:56Z) - Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook [85.43403500874889]
Retrieval-augmented generation (RAG) has emerged as a pivotal technique in artificial intelligence (AI)<n>Recent advancements in RAG for embodied AI, with a particular focus on applications in planning, task execution, multimodal perception, interaction, and specialized domains.
arXiv Detail & Related papers (2025-03-23T10:33:28Z) - A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning [67.72413262980272]
Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear.<n>We develop SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck.<n>Our approach achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations.
arXiv Detail & Related papers (2025-03-10T06:18:31Z) - VACT: A Video Automatic Causal Testing System and a Benchmark [55.53300306960048]
VACT is an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios.<n>We introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs.
arXiv Detail & Related papers (2025-03-08T10:54:42Z) - EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery [52.992415247012296]
We introduce EndoChat to address various dialogue paradigms and subtasks in surgical scene understanding.<n>Our model achieves state-of-the-art performance across five dialogue paradigms and eight surgical scene understanding tasks.
arXiv Detail & Related papers (2025-01-20T09:12:06Z) - Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models [1.4042211166197214]
We introduce an LVLM specifically designed for surgical scenarios.
We establish a LVLM model, Surgical-LLaVA, fine-tuned on instruction following data of surgical scenarios.
Experiments demonstrate that Surgical-LLaVA exhibits impressive multi-modal chat abilities in surgical contexts.
arXiv Detail & Related papers (2024-10-13T07:12:35Z) - Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data.<n>We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z) - ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models [55.07988373824348]
We study the visual generalization capabilities of three existing robotic foundation models.<n>Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios.<n>We propose a gradual backbone reversal approach founded on model merging.
arXiv Detail & Related papers (2024-09-23T17:47:59Z) - ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling [41.30327565949726]
We introduce ORacle, an advanced vision-language model designed for holistic OR domain modeling.
It incorporates multi-view and temporal capabilities and can leverage external knowledge during inference, enabling it to adapt to previously unseen surgical scenarios.
In rigorous testing, in scene graph generation, and downstream tasks on the 4D-OR dataset, ORacle not only demonstrates state-of-the-art performance but does so requiring less data than existing models.
arXiv Detail & Related papers (2024-04-10T14:24:10Z) - Pixel-Wise Recognition for Holistic Surgical Scene Understanding [33.40319680006502]
This paper presents the Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies dataset.<n>Our benchmark models surgical scene understanding as a hierarchy of complementary tasks with varying levels of granularity.<n>To exploit our proposed benchmark, we introduce the Transformers for Actions, Phases, Steps, and Instrument (TAPIS) model.
arXiv Detail & Related papers (2024-01-20T09:09:52Z) - Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [50.09187683845788]
Recent advancements in surgical computer vision applications have been driven by vision-only models.<n>These methods rely on manually annotated surgical videos to predict a fixed set of object categories.<n>In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals.
arXiv Detail & Related papers (2023-07-27T22:38:12Z) - Dissecting Self-Supervised Learning Methods for Surgical Computer Vision [51.370873913181605]
Self-Supervised Learning (SSL) methods have begun to gain traction in the general computer vision community.
The effectiveness of SSL methods in more complex and impactful domains, such as medicine and surgery, remains limited and unexplored.
We present an extensive analysis of the performance of these methods on the Cholec80 dataset for two fundamental and popular tasks in surgical context understanding, phase recognition and tool presence detection.
arXiv Detail & Related papers (2022-07-01T14:17:11Z) - Surgical Visual Domain Adaptation: Results from the MICCAI 2020
SurgVisDom Challenge [9.986124942784969]
This work seeks to explore the potential for visual domain adaptation in surgery to overcome data privacy concerns.
In particular, we propose to use video from virtual reality (VR) simulations of surgical exercises to develop algorithms to recognize tasks in a clinical-like setting.
We present the performance of the different approaches to solve visual domain adaptation developed by challenge participants.
arXiv Detail & Related papers (2021-02-26T18:45:28Z) - Machine learning-based clinical prediction modeling -- A practical guide
for clinicians [0.0]
The number of manuscripts related to machine learning or artificial intelligence has exponentially increased over the past years.
In the first section, we provide explanations on the general principles of machine learning.
In further sections, we review the importance of resampling, overfitting and model generalizability and strategies for model evaluation.
arXiv Detail & Related papers (2020-06-23T20:11:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.