SurgLaVi: Large-Scale Hierarchical Dataset for Surgical Vision-Language Representation Learning
- URL: http://arxiv.org/abs/2509.10555v1
- Date: Tue, 09 Sep 2025 21:21:10 GMT
- Title: SurgLaVi: Large-Scale Hierarchical Dataset for Surgical Vision-Language Representation Learning
- Authors: Alejandra Perez, Chinedu Nwoye, Ramtin Raji Kermani, Omid Mohareri, Muhammad Abdullah Jamal,
- Abstract summary: We present SurgLaVi, the largest and most diverse surgical vision-language dataset to date, comprising nearly 240k clip-caption pairs from more than 200 procedures.<n>At the core of SurgLaVi lies a fully automated pipeline that generates fine-grained transcriptions of surgical videos and segments them into coherent procedural units.<n>To ensure accessibility, we release SurgLaVi-beta, an open-source derivative of 113k clip-caption pairs constructed entirely from public data.
- Score: 41.95743276961411
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision-language pre-training (VLP) offers unique advantages for surgery by aligning language with surgical videos, enabling workflow understanding and transfer across tasks without relying on expert-labeled datasets. However, progress in surgical VLP remains constrained by the limited scale, procedural diversity, semantic quality, and hierarchical structure of existing datasets. In this work, we present SurgLaVi, the largest and most diverse surgical vision-language dataset to date, comprising nearly 240k clip-caption pairs from more than 200 procedures, and comprising hierarchical levels at phase-, step-, and task-level. At the core of SurgLaVi lies a fully automated pipeline that systematically generates fine-grained transcriptions of surgical videos and segments them into coherent procedural units. To ensure high-quality annotations, it applies dual-modality filtering to remove irrelevant and noisy samples. Within this framework, the resulting captions are enriched with contextual detail, producing annotations that are both semantically rich and easy to interpret. To ensure accessibility, we release SurgLaVi-\b{eta}, an open-source derivative of 113k clip-caption pairs constructed entirely from public data, which is over four times larger than existing surgical VLP datasets. To demonstrate the value of SurgLaVi datasets, we introduce SurgCLIP, a CLIP-style video-text contrastive framework with dual encoders, as a representative base model. SurgCLIP achieves consistent improvements across phase, step, action, and tool recognition, surpassing prior state-of-the-art methods, often by large margins. These results validate that large-scale, semantically rich, and hierarchically structured datasets directly translate into stronger and more generalizable representations, establishing SurgLaVi as a key resource for developing surgical foundation models.
Related papers
- Vision-Language Enhanced Foundation Model for Semi-supervised Medical Image Segmentation [6.524403694193453]
Semi-supervised learning (SSL) has emerged as an effective paradigm for medical image segmentation.<n>We introduce a Vision-Language Enhanced Semi-supervised Assistant (VESSA) that incorporates foundation-level visual-semantic understanding into SSL frameworks.<n>In Stage 1, VESSA is trained as a reference-guided segmentation assistant using a template bank containing gold-standard exemplars.<n>In Stage 2, VESSA is integrated into a state-of-the-art SSL framework, enabling dynamic interaction with the student model.
arXiv Detail & Related papers (2025-11-24T22:33:19Z) - HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models [15.877790469608662]
We introduce an LVLM-driven data refinement pipeline to enhance the quality of image-text pair data.<n>We propose a training paradigm that extends conventional contrastive learning by incorporating negative descriptions and short tags.<n>Our approach achieves state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks.
arXiv Detail & Related papers (2025-07-30T07:21:36Z) - HiLa: Hierarchical Vision-Language Collaboration for Cancer Survival Prediction [55.00788339683146]
We propose a novel Hierarchical vision-Language collaboration framework for improved survival prediction.<n> Specifically, HiLa employs pretrained feature extractors to generate hierarchical visual features from WSIs at both patch and region levels.<n>This ap-proach enables the comprehensive learning of discriminative visual features cor-responding to different survival-related attributes from prompts.
arXiv Detail & Related papers (2025-07-07T02:06:25Z) - SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model [67.8359850515282]
SurgVidLM is the first video language model designed to address both full and fine-grained surgical video comprehension.<n>We show that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs of comparable parameter scale in both full and fine-grained video understanding tasks.
arXiv Detail & Related papers (2025-06-22T02:16:18Z) - Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI [15.513949299806582]
The automatic summarization of surgical videos is essential for enhancing procedural documentation, supporting surgical training, and facilitating post-operative analysis.<n>We propose a multi-modal framework that leverages recent advancements in computer vision and large language models to generate comprehensive video summaries.<n>We evaluate our method on the CholecT50 dataset, using instrument and action annotations from 50 laparoscopic videos.
arXiv Detail & Related papers (2025-04-28T15:46:02Z) - OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining [60.75854609803651]
OphCLIP is a hierarchical retrieval-augmented vision-language pretraining framework for ophthalmic surgical workflow understanding.<n>OphCLIP learns both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles.<n>Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos.
arXiv Detail & Related papers (2024-11-23T02:53:08Z) - Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data.<n>We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z) - LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning [15.646322352232819]
We create a new dataset, Surg-QA, consisting of 102,000 surgical video-instruction pairs.
We propose a novel two-stage question-answer generation pipeline with LLM to learn surgical knowledge.
We train LLaVA-Surg, a novel vision-language conversational assistant capable of answering open-ended questions about surgical videos.
arXiv Detail & Related papers (2024-08-15T07:00:20Z) - HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition [51.222684687924215]
HecVL is a novel hierarchical video-language pretraining approach for building a generalist surgical model.<n>By disentangling embedding spaces of different hierarchical levels, the learned multi-modal representations encode short-term and long-term surgical concepts in the same model.<n>We show that the same HecVL model for surgical phase recognition can be transferred across different surgical procedures and medical centers.
arXiv Detail & Related papers (2024-05-16T13:14:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.