OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining
- URL: http://arxiv.org/abs/2411.15421v1
- Date: Sat, 23 Nov 2024 02:53:08 GMT
- Title: OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining
- Authors: Ming Hu, Kun Yuan, Yaling Shen, Feilong Tang, Xiaohao Xu, Lin Zhou, Wei Li, Ying Chen, Zhongxing Xu, Zelin Peng, Siyuan Yan, Vinkle Srivastav, Diping Song, Tianbin Li, Danli Shi, Jin Ye, Nicolas Padoy, Nassir Navab, Junjun He,
- Abstract summary: OphCLIP is a hierarchical retrieval-augmented vision-language pretraining framework for ophthalmic surgical workflow understanding.
OphCLIP learns both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles.
Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos.
- Score: 55.15365161143354
- License:
- Abstract: Surgical practice involves complex visual interpretation, procedural skills, and advanced medical knowledge, making surgical vision-language pretraining (VLP) particularly challenging due to this complexity and the limited availability of annotated data. To address the gap, we propose OphCLIP, a hierarchical retrieval-augmented vision-language pretraining framework specifically designed for ophthalmic surgical workflow understanding. OphCLIP leverages the OphVL dataset we constructed, a large-scale and comprehensive collection of over 375K hierarchically structured video-text pairs with tens of thousands of different combinations of attributes (surgeries, phases/operations/actions, instruments, medications, as well as more advanced aspects like the causes of eye diseases, surgical objectives, and postoperative recovery recommendations, etc). These hierarchical video-text correspondences enable OphCLIP to learn both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles, capturing intricate surgical details and high-level procedural insights, respectively. Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos, automatically retrieving semantically relevant content to enhance the representation learning of narrative videos. Evaluation across 11 datasets for phase recognition and multi-instrument identification shows OphCLIP's robust generalization and superior performance.
Related papers
- Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models [1.4042211166197214]
We introduce an LVLM specifically designed for surgical scenarios.
We establish a LVLM model, Surgical-LLaVA, fine-tuned on instruction following data of surgical scenarios.
Experiments demonstrate that Surgical-LLaVA exhibits impressive multi-modal chat abilities in surgical contexts.
arXiv Detail & Related papers (2024-10-13T07:12:35Z) - Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data.
We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z) - HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition [51.222684687924215]
HecVL is a novel hierarchical video-language pretraining approach for building a generalist surgical model.
We propose a novel fine-to-coarse contrastive learning framework that learns separate embedding spaces for the three video-text hierarchies.
By disentangling embedding spaces of different hierarchical levels, the learned multi-modal representations encode short-term and long-term surgical concepts in the same model.
arXiv Detail & Related papers (2024-05-16T13:14:43Z) - Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery [15.47190687192761]
We introduce Surgical-LVLM, a novel personalized large vision-language model tailored for complex surgical scenarios.
We demonstrate the effectiveness of Surgical-LVLM on several benchmarks, including EndoVis-17-VQLA, EndoVis-18-VQLA, and a newly introduced EndoVis Conversations dataset.
arXiv Detail & Related papers (2024-03-22T08:38:27Z) - Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning [64.1316997189396]
We present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images.
Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets.
arXiv Detail & Related papers (2024-03-21T17:58:56Z) - CLIP in Medical Imaging: A Comprehensive Survey [59.429714742927956]
Contrastive Language-Image Pre-training successfully introduces text supervision to vision models.
It has shown promising results across various tasks, attributable to its generalizability and interpretability.
Use of CLIP has recently gained increasing interest in the medical imaging domain.
arXiv Detail & Related papers (2023-12-12T15:21:57Z) - Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [51.78027546947034]
Recent advancements in surgical computer vision have been driven by vision-only models, which lack language semantics.
We propose leveraging surgical video lectures from e-learning platforms to provide effective vision and language supervisory signals.
We address surgery-specific linguistic challenges using multiple automatic speech recognition systems for text transcriptions.
arXiv Detail & Related papers (2023-07-27T22:38:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.