Related papers: HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition

HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition

URL: http://arxiv.org/abs/2405.10075v1
Date: Thu, 16 May 2024 13:14:43 GMT
Title: HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition
Authors: Kun Yuan, Vinkle Srivastav, Nassir Navab, Nicolas Padoy,
Abstract summary: HecVL is a novel hierarchical video-language pretraining approach for building a generalist surgical model. We propose a novel fine-to-coarse contrastive learning framework that learns separate embedding spaces for the three video-text hierarchies. By disentangling embedding spaces of different hierarchical levels, the learned multi-modal representations encode short-term and long-term surgical concepts in the same model.
Score: 51.222684687924215
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Natural language could play an important role in developing generalist surgical models by providing a broad source of supervision from raw texts. This flexible form of supervision can enable the model's transferability across datasets and tasks as natural language can be used to reference learned visual concepts or describe new ones. In this work, we present HecVL, a novel hierarchical video-language pretraining approach for building a generalist surgical model. Specifically, we construct a hierarchical video-text paired dataset by pairing the surgical lecture video with three hierarchical levels of texts: at clip-level, atomic actions using transcribed audio texts; at phase-level, conceptual text summaries; and at video-level, overall abstract text of the surgical procedure. Then, we propose a novel fine-to-coarse contrastive learning framework that learns separate embedding spaces for the three video-text hierarchies using a single model. By disentangling embedding spaces of different hierarchical levels, the learned multi-modal representations encode short-term and long-term surgical concepts in the same model. Thanks to the injected textual semantics, we demonstrate that the HecVL approach can enable zero-shot surgical phase recognition without any human annotation. Furthermore, we show that the same HecVL model for surgical phase recognition can be transferred across different surgical procedures and medical centers.

Related papers

SurgLaVi: Large-Scale Hierarchical Dataset for Surgical Vision-Language Representation Learning [41.95743276961411]
We present SurgLaVi, the largest and most diverse surgical vision-language dataset to date, comprising nearly 240k clip-caption pairs from more than 200 procedures.<n>At the core of SurgLaVi lies a fully automated pipeline that generates fine-grained transcriptions of surgical videos and segments them into coherent procedural units.<n>To ensure accessibility, we release SurgLaVi-beta, an open-source derivative of 113k clip-caption pairs constructed entirely from public data.
arXiv Detail & Related papers (2025-09-09T21:21:10Z)
SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian Splatting [45.16104996137126]
We present SurgTPGS, a novel text-promptable Gaussian Splatting method to fill this gap.<n>We propose semantic-aware deformation tracking to capture the seamless deformation of semantic features, providing a more precise reconstruction for both texture and semantic features.<n>We conduct comprehensive experiments on two real-world surgical datasets to demonstrate the superiority of SurgTPGS over state-of-the-art methods.
arXiv Detail & Related papers (2025-06-29T15:55:01Z)
Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI [15.513949299806582]
The automatic summarization of surgical videos is essential for enhancing procedural documentation, supporting surgical training, and facilitating post-operative analysis. We propose a multi-modal framework that leverages recent advancements in computer vision and large language models to generate comprehensive video summaries. We evaluate our method on the CholecT50 dataset, using instrument and action annotations from 50 laparoscopic videos.
arXiv Detail & Related papers (2025-04-28T15:46:02Z)
Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis [13.702423348269155]
We propose a new task -- generating speech from videos of people and their transcripts (VTTS) -- to motivate new techniques for multimodal speech generation. We present a decoder-only multimodal model for this task, which we call Visatronic. It embeds vision, text and speech directly into the common subspace of a transformer model and uses an autoregressive loss to learn a generative model of discretized mel-spectrograms conditioned on speaker videos and transcripts of their speech.
arXiv Detail & Related papers (2024-11-26T18:57:29Z)
OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining [55.15365161143354]
OphCLIP is a hierarchical retrieval-augmented vision-language pretraining framework for ophthalmic surgical workflow understanding. OphCLIP learns both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles. Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos.
arXiv Detail & Related papers (2024-11-23T02:53:08Z)
Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data. We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z)
Enhancing Gait Video Analysis in Neurodegenerative Diseases by Knowledge Augmentation in Vision Language Model [10.742625681420279]
Based on a large-scale pre-trained Vision Language Model (VLM), our model learns and improves visual, textual, and numerical representations of patient gait videos. Results demonstrate that our model not only significantly outperforms state-of-the-art methods in video-based classification tasks but also adeptly decodes the learned class-specific text features into natural language descriptions.
arXiv Detail & Related papers (2024-03-20T17:03:38Z)
DeViL: Decoding Vision features into Language [53.88202366696955]
Post-hoc explanation methods have often been criticised for abstracting away the decision-making process of deep neural networks. In this work, we would like to provide natural language descriptions for what different layers of a vision backbone have learned. We train a transformer network to translate individual image features of any vision layer into a prompt that a separate off-the-shelf language model decodes into natural language.
arXiv Detail & Related papers (2023-09-04T13:59:55Z)
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [51.78027546947034]
Recent advancements in surgical computer vision have been driven by vision-only models, which lack language semantics. We propose leveraging surgical video lectures from e-learning platforms to provide effective vision and language supervisory signals. We address surgery-specific linguistic challenges using multiple automatic speech recognition systems for text transcriptions.
arXiv Detail & Related papers (2023-07-27T22:38:12Z)
Text Promptable Surgical Instrument Segmentation with Vision-Language Models [16.203166812021045]
We propose a novel text promptable surgical instrument segmentation approach to overcome challenges associated with diversity and differentiation of surgical instruments. We leverage pretrained image and text encoders as our model backbone and design a text promptable mask decoder. Experiments on several surgical instrument segmentation datasets demonstrate our model's superior performance and promising generalization capability.
arXiv Detail & Related papers (2023-06-15T16:26:20Z)
Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels. We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction. We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z)
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning. Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.