Related papers: SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model

SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model

URL: http://arxiv.org/abs/2506.17873v1
Date: Sun, 22 Jun 2025 02:16:18 GMT
Title: SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model
Authors: Guankun Wang, Wenjin Mo, Junyi Wang, Long Bai, Kun Yuan, Ming Hu, Jinlin Wu, Junjun He, Yiming Huang, Nicolas Padoy, Zhen Lei, Hongbin Liu, Nassir Navab, Hongliang Ren,
Abstract summary: SurgVidLM is the first video language model designed to address both full and fine-grained surgical video comprehension.<n>We introduce the StageFocus mechanism which is a two-stage framework performing the multi-grained, progressive understanding of surgical videos.<n> Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs in both full and fine-grained video understanding tasks.
Score: 55.13206879750197
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in Multimodal Large Language Models have demonstrated great potential in the medical domain, facilitating users to understand surgical scenes and procedures. Beyond image-based methods, the exploration of Video Large Language Models (Vid-LLMs) has emerged as a promising avenue for capturing the complex sequences of information involved in surgery. However, there is still a lack of Vid-LLMs specialized for fine-grained surgical video understanding tasks, which is crucial for analyzing specific processes or details within a surgical procedure. To bridge this gap, we propose SurgVidLM, the first video language model designed to address both full and fine-grained surgical video comprehension. To train our SurgVidLM, we construct the SVU-31K dataset which consists of over 31K video-instruction pairs, enabling both holistic understanding and detailed analysis of surgical procedures. Furthermore, we introduce the StageFocus mechanism which is a two-stage framework performing the multi-grained, progressive understanding of surgical videos. We also develop the Multi-frequency Fusion Attention to effectively integrate low and high-frequency visual tokens, ensuring the retention of critical information. Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs in both full and fine-grained video understanding tasks, showcasing its superior capability in capturing complex procedural contexts.

Related papers

SurgLLM: A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding [75.00667948967848]
The SurgLLM framework is a large multimodal model tailored for versatile surgical video understanding tasks.<n>To empower the spatial focus of surgical videos, we first devise Surgical Context-aware Multimodal Pretraining (Surg-Pretrain) for the video encoder of SurgLLM.<n>To incorporate surgical temporal knowledge into SurgLLM, we further propose Temporal-aware Multimodal Tuning (TM-Tuning) to enhance temporal reasoning with interleaved multimodal embeddings.
arXiv Detail & Related papers (2025-08-30T04:36:41Z)
Data-Efficient Learning for Generalizable Surgical Video Understanding [0.0]
This research aims to bridge gap between deep learning-based surgical video analysis in research and its real-world clinical environments.<n>I benchmarked state-of-the-art neural network architectures to identify the most effective designs for each task.<n>We developed semi-driven frameworks that improve model performance across tasks by leveraging large amounts of unlabeled surgical video.
arXiv Detail & Related papers (2025-08-13T22:00:23Z)
Surgical Foundation Model Leveraging Compression and Entropy Maximization for Image-Guided Surgical Assistance [50.486523249499115]
Real-time video understanding is critical to guide procedures in minimally invasive surgery (MIS)<n>We propose Compress-to-Explore (C2E), a novel self-supervised framework to learn compact, informative representations from surgical videos.<n>C2E uses entropy-maximizing decoders to compress images while preserving clinically relevant details, improving encoder performance without labeled data.
arXiv Detail & Related papers (2025-05-16T14:02:24Z)
Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI [15.513949299806582]
The automatic summarization of surgical videos is essential for enhancing procedural documentation, supporting surgical training, and facilitating post-operative analysis.<n>We propose a multi-modal framework that leverages recent advancements in computer vision and large language models to generate comprehensive video summaries.<n>We evaluate our method on the CholecT50 dataset, using instrument and action annotations from 50 laparoscopic videos.
arXiv Detail & Related papers (2025-04-28T15:46:02Z)
Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding [1.024113475677323]
The lack of datasets hinders the development of accurate and comprehensive workflow analysis solutions.<n>We introduce a novel approach for addressing the sparsity and heterogeneity of data inspired by the human learning procedure of watching experts and understanding their explanations.<n>We present the first comprehensive solution for dense video captioning (DVC) of surgical videos, addressing this task despite the absence of existing datasets in the surgical domain.
arXiv Detail & Related papers (2025-03-14T13:36:13Z)
EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery [52.992415247012296]
We introduce EndoChat to address various dialogue paradigms and subtasks in surgical scene understanding.<n>Our model achieves state-of-the-art performance across five dialogue paradigms and eight surgical scene understanding tasks.
arXiv Detail & Related papers (2025-01-20T09:12:06Z)
OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining [60.75854609803651]
OphCLIP is a hierarchical retrieval-augmented vision-language pretraining framework for ophthalmic surgical workflow understanding.<n>OphCLIP learns both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles.<n>Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos.
arXiv Detail & Related papers (2024-11-23T02:53:08Z)
Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models [1.4042211166197214]
We introduce an LVLM specifically designed for surgical scenarios. We establish a LVLM model, Surgical-LLaVA, fine-tuned on instruction following data of surgical scenarios. Experiments demonstrate that Surgical-LLaVA exhibits impressive multi-modal chat abilities in surgical contexts.
arXiv Detail & Related papers (2024-10-13T07:12:35Z)
Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data.<n>We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z)
LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning [15.646322352232819]
We create a new dataset, Surg-QA, consisting of 102,000 surgical video-instruction pairs. We propose a novel two-stage question-answer generation pipeline with LLM to learn surgical knowledge. We train LLaVA-Surg, a novel vision-language conversational assistant capable of answering open-ended questions about surgical videos.
arXiv Detail & Related papers (2024-08-15T07:00:20Z)
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [50.09187683845788]
Recent advancements in surgical computer vision applications have been driven by vision-only models.<n>These methods rely on manually annotated surgical videos to predict a fixed set of object categories.<n>In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals.
arXiv Detail & Related papers (2023-07-27T22:38:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.