General surgery vision transformer: A video pre-trained foundation model for general surgery
- URL: http://arxiv.org/abs/2403.05949v3
- Date: Fri, 12 Apr 2024 22:30:54 GMT
- Title: General surgery vision transformer: A video pre-trained foundation model for general surgery
- Authors: Samuel Schmidgall, Ji Woong Kim, Jeffrey Jopling, Axel Krieger,
- Abstract summary: We open-source the largest dataset of general surgery videos to-date, consisting of 680 hours of surgical videos.
We propose a technique for video pre-training a general surgery vision transformer (GSViT) on surgical videos based on forward video prediction.
- Score: 2.576958141988598
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The absence of openly accessible data and specialized foundation models is a major barrier for computational research in surgery. Toward this, (i) we open-source the largest dataset of general surgery videos to-date, consisting of 680 hours of surgical videos, including data from robotic and laparoscopic techniques across 28 procedures; (ii) we propose a technique for video pre-training a general surgery vision transformer (GSViT) on surgical videos based on forward video prediction that can run in real-time for surgical applications, toward which we open-source the code and weights of GSViT; (iii) we also release code and weights for procedure-specific fine-tuned versions of GSViT across 10 procedures; (iv) we demonstrate the performance of GSViT on the Cholec80 phase annotation task, displaying improved performance over state-of-the-art single frame predictors.
Related papers
- SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model [55.13206879750197]
SurgVidLM is the first video language model designed to address both full and fine-grained surgical video comprehension.<n>We introduce the StageFocus mechanism which is a two-stage framework performing the multi-grained, progressive understanding of surgical videos.<n> Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs in both full and fine-grained video understanding tasks.
arXiv Detail & Related papers (2025-06-22T02:16:18Z) - SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis [20.566701996432226]
SurgBench is a unified surgical video benchmarking framework comprising a pretraining dataset, textbfSurgBench-P, and an evaluation benchmark, textbfSurgBench-E.<n>SurgBench-P covers 53 million frames across 22 surgical procedures and 11 specialties, and SurgBench-E provides robust evaluation across six categories (phase classification, camera motion, tool recognition, disease diagnosis, action classification, and organ detection) spanning 72 fine-grained tasks.
arXiv Detail & Related papers (2025-06-09T10:02:58Z) - SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence [72.10889173696928]
We propose SurgVLM, one of the first large vision-language foundation models for surgical intelligence.<n>We construct a large-scale multimodal surgical database, SurgVLM-DB, spanning more than 16 surgical types and 18 anatomical structures.<n>Building upon this comprehensive dataset, we propose SurgVLM, which is built upon Qwen2.5-VL, and undergoes instruction tuning to 10+ surgical tasks.
arXiv Detail & Related papers (2025-06-03T07:44:41Z) - Surgical Foundation Model Leveraging Compression and Entropy Maximization for Image-Guided Surgical Assistance [50.486523249499115]
Real-time video understanding is critical to guide procedures in minimally invasive surgery (MIS)<n>We propose Compress-to-Explore (C2E), a novel self-supervised framework to learn compact, informative representations from surgical videos.<n>C2E uses entropy-maximizing decoders to compress images while preserving clinically relevant details, improving encoder performance without labeled data.
arXiv Detail & Related papers (2025-05-16T14:02:24Z) - Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI [15.513949299806582]
The automatic summarization of surgical videos is essential for enhancing procedural documentation, supporting surgical training, and facilitating post-operative analysis.
We propose a multi-modal framework that leverages recent advancements in computer vision and large language models to generate comprehensive video summaries.
We evaluate our method on the CholecT50 dataset, using instrument and action annotations from 50 laparoscopic videos.
arXiv Detail & Related papers (2025-04-28T15:46:02Z) - OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining [55.15365161143354]
OphCLIP is a hierarchical retrieval-augmented vision-language pretraining framework for ophthalmic surgical workflow understanding.
OphCLIP learns both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles.
Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos.
arXiv Detail & Related papers (2024-11-23T02:53:08Z) - VISAGE: Video Synthesis using Action Graphs for Surgery [34.21344214645662]
We introduce the novel task of future video generation in laparoscopic surgery.
Our proposed method, VISAGE, leverages the power of action scene graphs to capture the sequential nature of laparoscopic procedures.
Results of our experiments demonstrate high-fidelity video generation for laparoscopy procedures.
arXiv Detail & Related papers (2024-10-23T10:28:17Z) - Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data.
We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z) - PitVis-2023 Challenge: Workflow Recognition in videos of Endoscopic Pituitary Surgery [46.2901962659261]
The Pituitary Vision (VisVis) 2023 Challenge tasks the community to step and instrument recognition in videos of endoscopic pituitary surgery.
This is a unique task when compared to other minimally invasive surgeries due to the smaller working space.
There were 18-s from 9-teams across 6-countries, using a variety of deep learning models.
arXiv Detail & Related papers (2024-09-02T11:38:06Z) - Thoracic Surgery Video Analysis for Surgical Phase Recognition [0.08706730566331035]
We analyse and evaluate both frame-based and video clipping-based phase recognition on thoracic surgery dataset consisting of 11 classes of phases.
We show that Masked Video Distillation(MVD) exhibits superior performance, achieving a top-1 accuracy of 72.9%, compared to 52.31% achieved by ImageNet ViT.
arXiv Detail & Related papers (2024-06-13T14:47:57Z) - OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding [26.962250661485967]
OphNet is a large-scale, expert-annotated video benchmark for ophthalmic surgical workflow understanding.
A diverse collection of 2,278 surgical videos spanning 66 types of cataract, glaucoma, and corneal surgeries, with detailed annotations for 102 unique surgical phases and 150 fine-grained operations.
OphNet is about 20 times larger than the largest existing surgical workflow analysis benchmark.
arXiv Detail & Related papers (2024-06-11T17:18:11Z) - Endora: Video Generation Models as Endoscopy Simulators [53.72175969751398]
This paper introduces model, an innovative approach to generate medical videos that simulate clinical endoscopy scenes.
We also pioneer the first public benchmark for endoscopy simulation with video generation models.
Endora marks a notable breakthrough in the deployment of generative AI for clinical endoscopy research.
arXiv Detail & Related papers (2024-03-17T00:51:59Z) - Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [50.09187683845788]
Recent advancements in surgical computer vision applications have been driven by vision-only models.
These methods rely on manually annotated surgical videos to predict a fixed set of object categories.
In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals.
arXiv Detail & Related papers (2023-07-27T22:38:12Z) - LoViT: Long Video Transformer for Surgical Phase Recognition [59.06812739441785]
We present a two-stage method, called Long Video Transformer (LoViT) for fusing short- and long-term temporal information.
Our approach outperforms state-of-the-art methods on the Cholec80 and AutoLaparo datasets consistently.
arXiv Detail & Related papers (2023-05-15T20:06:14Z) - A real-time spatiotemporal AI model analyzes skill in open surgical
videos [2.4907439112059278]
Our work overcomes existing data limitations for training AI models by curating, from YouTube, the largest dataset of open surgical videos to date: 1997 videos from 23 surgical procedures uploaded from 50 countries.
We developed a multi-task AI model capable of real-time understanding of surgical behaviors, hands, and tools - the building blocks of procedural flow and surgeon skill.
arXiv Detail & Related papers (2021-12-14T08:11:02Z) - Multimodal Semantic Scene Graphs for Holistic Modeling of Surgical
Procedures [70.69948035469467]
We take advantage of the latest computer vision methodologies for generating 3D graphs from camera views.
We then introduce the Multimodal Semantic Graph Scene (MSSG) which aims at providing unified symbolic and semantic representation of surgical procedures.
arXiv Detail & Related papers (2021-06-09T14:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.