Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures
- URL: http://arxiv.org/abs/2307.15220v4
- Date: Thu, 27 Mar 2025 15:37:03 GMT
- Title: Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures
- Authors: Kun Yuan, Vinkle Srivastav, Tong Yu, Joel L. Lavanchy, Jacques Marescaux, Pietro Mascagni, Nassir Navab, Nicolas Padoy,
- Abstract summary: Recent advancements in surgical computer vision applications have been driven by vision-only models.<n>These methods rely on manually annotated surgical videos to predict a fixed set of object categories.<n>In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals.
- Score: 50.09187683845788
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in surgical computer vision applications have been driven by vision-only models, which do not explicitly integrate the rich semantics of language into their design. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals for multi-modal representation learning without relying on manual annotations. We address the surgery-specific linguistic challenges present in surgical video lectures by employing multiple complementary automatic speech recognition systems to generate text transcriptions. We then present a novel method, SurgVLP - Surgical Vision Language Pre-training, for multi-modal representation learning. Extensive experiments across diverse surgical procedures and tasks demonstrate that the multi-modal representations learned by SurgVLP exhibit strong transferability and adaptability in surgical video analysis. Furthermore, our zero-shot evaluations highlight SurgVLP's potential as a general-purpose foundation model for surgical workflow analysis, reducing the reliance on extensive manual annotations for downstream tasks, and facilitating adaptation methods such as few-shot learning to build a scalable and data-efficient solution for various downstream surgical applications. The [training code](https://github.com/CAMMA-public/SurgVLP) and [weights](https://github.com/CAMMA-public/PeskaVLP) are public.
Related papers
- Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI [15.513949299806582]
The automatic summarization of surgical videos is essential for enhancing procedural documentation, supporting surgical training, and facilitating post-operative analysis.
We propose a multi-modal framework that leverages recent advancements in computer vision and large language models to generate comprehensive video summaries.
We evaluate our method on the CholecT50 dataset, using instrument and action annotations from 50 laparoscopic videos.
arXiv Detail & Related papers (2025-04-28T15:46:02Z) - Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding [1.024113475677323]
The lack of datasets hinders the development of accurate and comprehensive workflow analysis solutions.
We introduce a novel approach for addressing the sparsity and heterogeneity of data inspired by the human learning procedure of watching experts and understanding their explanations.
We present the first comprehensive solution for dense video captioning (DVC) of surgical videos, addressing this task despite the absence of existing datasets in the surgical domain.
arXiv Detail & Related papers (2025-03-14T13:36:13Z) - EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery [52.992415247012296]
We introduce EndoChat to address various dialogue paradigms and subtasks in surgical scene understanding.
Our model achieves state-of-the-art performance across five dialogue paradigms and eight surgical scene understanding tasks.
arXiv Detail & Related papers (2025-01-20T09:12:06Z) - OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining [55.15365161143354]
OphCLIP is a hierarchical retrieval-augmented vision-language pretraining framework for ophthalmic surgical workflow understanding.
OphCLIP learns both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles.
Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos.
arXiv Detail & Related papers (2024-11-23T02:53:08Z) - Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models [1.4042211166197214]
We introduce an LVLM specifically designed for surgical scenarios.
We establish a LVLM model, Surgical-LLaVA, fine-tuned on instruction following data of surgical scenarios.
Experiments demonstrate that Surgical-LLaVA exhibits impressive multi-modal chat abilities in surgical contexts.
arXiv Detail & Related papers (2024-10-13T07:12:35Z) - Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data.
We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z) - HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition [51.222684687924215]
HecVL is a novel hierarchical video-language pretraining approach for building a generalist surgical model.
We propose a novel fine-to-coarse contrastive learning framework that learns separate embedding spaces for the three video-text hierarchies.
By disentangling embedding spaces of different hierarchical levels, the learned multi-modal representations encode short-term and long-term surgical concepts in the same model.
arXiv Detail & Related papers (2024-05-16T13:14:43Z) - Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery [15.47190687192761]
We introduce Surgical-LVLM, a novel personalized large vision-language model tailored for complex surgical scenarios.
We demonstrate the effectiveness of Surgical-LVLM on several benchmarks, including EndoVis-17-VQLA, EndoVis-18-VQLA, and a newly introduced EndoVis Conversations dataset.
arXiv Detail & Related papers (2024-03-22T08:38:27Z) - CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual
Question Localized-Answering in Robotic Surgery [14.52406034300867]
A surgical Visual Question Localized-Answering (VQLA) system would be helpful for medical students and junior surgeons to learn and understand from recorded surgical videos.
We propose an end-to-end Transformer with the Co-Attention gaTed Vision-Language (CAT-ViL) embedding for VQLA in surgical scenarios.
The proposed method provides a promising solution for surgical scene understanding, and opens up a primary step in the Artificial Intelligence (AI)-based VQLA system for surgical training.
arXiv Detail & Related papers (2023-07-11T11:35:40Z) - Multi-Modal Masked Autoencoders for Medical Vision-and-Language
Pre-Training [62.215025958347105]
We propose a self-supervised learning paradigm with multi-modal masked autoencoders.
We learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts.
arXiv Detail & Related papers (2022-09-15T07:26:43Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.