Related papers: SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence

SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence

URL: http://arxiv.org/abs/2506.02555v1
Date: Tue, 03 Jun 2025 07:44:41 GMT
Title: SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence
Authors: Zhitao Zeng, Zhu Zhuo, Xiaojun Jia, Erli Zhang, Junde Wu, Jiaan Zhang, Yuxuan Wang, Chang Han Low, Jian Jiang, Zilong Zheng, Xiaochun Cao, Yutong Ban, Qi Dou, Yang Liu, Yueming Jin,
Abstract summary: We propose SurgVLM, one of the first large vision-language foundation models for surgical intelligence.<n>We construct a large-scale multimodal surgical database, SurgVLM-DB, spanning more than 16 surgical types and 18 anatomical structures.<n>Building upon this comprehensive dataset, we propose SurgVLM, which is built upon Qwen2.5-VL, and undergoes instruction tuning to 10+ surgical tasks.
Score: 72.10889173696928
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Foundation models have achieved transformative success across biomedical domains by enabling holistic understanding of multimodal data. However, their application in surgery remains underexplored. Surgical intelligence presents unique challenges - requiring surgical visual perception, temporal analysis, and reasoning. Existing general-purpose vision-language models fail to address these needs due to insufficient domain-specific supervision and the lack of a large-scale high-quality surgical database. To bridge this gap, we propose SurgVLM, one of the first large vision-language foundation models for surgical intelligence, where this single universal model can tackle versatile surgical tasks. To enable this, we construct a large-scale multimodal surgical database, SurgVLM-DB, comprising over 1.81 million frames with 7.79 million conversations, spanning more than 16 surgical types and 18 anatomical structures. We unify and reorganize 23 public datasets across 10 surgical tasks, followed by standardizing labels and doing hierarchical vision-language alignment to facilitate comprehensive coverage of gradually finer-grained surgical tasks, from visual perception, temporal analysis, to high-level reasoning. Building upon this comprehensive dataset, we propose SurgVLM, which is built upon Qwen2.5-VL, and undergoes instruction tuning to 10+ surgical tasks. We further construct a surgical multimodal benchmark, SurgVLM-Bench, for method evaluation. SurgVLM-Bench consists of 6 popular and widely-used datasets in surgical domain, covering several crucial downstream tasks. Based on SurgVLM-Bench, we evaluate the performance of our SurgVLM (3 SurgVLM variants: SurgVLM-7B, SurgVLM-32B, and SurgVLM-72B), and conduct comprehensive comparisons with 14 mainstream commercial VLMs (e.g., GPT-4o, Gemini 2.0 Flash, Qwen2.5-Max).

Related papers

Surgery-R1: Advancing Surgical-VQLA with Reasoning Multimodal Large Language Model via Reinforcement Learning [9.858649381667695]
We propose the first Reasoning Multimodal Large Language Models for Surgical-VQLA (Surgery-R1)<n>Surgery-R1 is inspired by the development of Reasoning Multimodal Large Language Models (MLLMs)<n>Experiment results demonstrate that Surgery-R1 outperforms other existing state-of-the-art (SOTA) models in the Surgical-VQLA task and widely-used MLLMs.
arXiv Detail & Related papers (2025-06-24T09:53:10Z)
SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model [55.13206879750197]
SurgVidLM is the first video language model designed to address both full and fine-grained surgical video comprehension.<n>We introduce the StageFocus mechanism which is a two-stage framework performing the multi-grained, progressive understanding of surgical videos.<n> Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs in both full and fine-grained video understanding tasks.
arXiv Detail & Related papers (2025-06-22T02:16:18Z)
SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis [20.566701996432226]
SurgBench is a unified surgical video benchmarking framework comprising a pretraining dataset, textbfSurgBench-P, and an evaluation benchmark, textbfSurgBench-E.<n>SurgBench-P covers 53 million frames across 22 surgical procedures and 11 specialties, and SurgBench-E provides robust evaluation across six categories (phase classification, camera motion, tool recognition, disease diagnosis, action classification, and organ detection) spanning 72 fine-grained tasks.
arXiv Detail & Related papers (2025-06-09T10:02:58Z)
Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study [0.6120768859742071]
We present the first large-scale study assessing the capabilities of Vision Language Models (VLMs) for endoscopic tasks.<n>Using a diverse set of state-of-the-art models, multiple surgical datasets, and extensive human reference annotations, we address three key research questions.<n>Our results reveal that VLMs can effectively perform basic surgical perception tasks, such as object counting and localization, with performance levels comparable to general domain tasks.
arXiv Detail & Related papers (2025-06-06T16:53:12Z)
Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery: Challenges and opportunities [2.9212404280476267]
Vision-language models (VLMs) can be trained on large volumes of raw image-text pairs and exhibit strong adaptability.<n>We conduct a benchmarking study of several popular VLMs across diverse laparoscopic datasets.<n>Our findings reveal a mismatch between prediction accuracy and visual grounding, indicating that models may make correct predictions while focusing on irrelevant areas of the image.
arXiv Detail & Related papers (2025-05-16T00:42:18Z)
Recognize Any Surgical Object: Unleashing the Power of Weakly-Supervised Data [15.00025814170182]
RASO is a foundation model designed to Recognize Any Surgical Object.<n>It generates tag-image-text pairs automatically from large-scale unannotated surgical lecture videos.<n>Our scalable data generation pipeline gathers 2,200 surgical procedures and produces 3.6 million tag annotations.
arXiv Detail & Related papers (2025-01-25T21:01:52Z)
EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery [52.992415247012296]
We introduce EndoChat to address various dialogue paradigms and subtasks in surgical scene understanding.<n>Our model achieves state-of-the-art performance across five dialogue paradigms and eight surgical scene understanding tasks.
arXiv Detail & Related papers (2025-01-20T09:12:06Z)
GP-VLS: A general-purpose vision language model for surgery [0.5249805590164902]
GP-VLS is a general-purpose vision language model for surgery. It integrates medical and surgical knowledge with visual scene understanding. We show GP-VLS significantly outperforms open- and closed-source models on surgical vision-language tasks.
arXiv Detail & Related papers (2024-07-27T17:27:05Z)
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [50.09187683845788]
Recent advancements in surgical computer vision applications have been driven by vision-only models.<n>These methods rely on manually annotated surgical videos to predict a fixed set of object categories.<n>In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals.
arXiv Detail & Related papers (2023-07-27T22:38:12Z)
Next-generation Surgical Navigation: Marker-less Multi-view 6DoF Pose Estimation of Surgical Instruments [64.59698930334012]
We present a multi-camera capture setup consisting of static and head-mounted cameras.<n>Second, we publish a multi-view RGB-D video dataset of ex-vivo spine surgeries, captured in a surgical wet lab and a real operating theatre.<n>Third, we evaluate three state-of-the-art single-view and multi-view methods for the task of 6DoF pose estimation of surgical instruments.
arXiv Detail & Related papers (2023-05-05T13:42:19Z)
CholecTriplet2021: A benchmark challenge for surgical action triplet recognition [66.51610049869393]
This paper presents CholecTriplet 2021: an endoscopic vision challenge organized at MICCAI 2021 for the recognition of surgical action triplets in laparoscopic videos. We present the challenge setup and assessment of the state-of-the-art deep learning methods proposed by the participants during the challenge. A total of 4 baseline methods and 19 new deep learning algorithms are presented to recognize surgical action triplets directly from surgical videos, achieving mean average precision (mAP) ranging from 4.2% to 38.1%.
arXiv Detail & Related papers (2022-04-10T18:51:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.