Related papers: SemiVT-Surge: Semi-Supervised Video Transformer for Surgical Phase Recognition

SemiVT-Surge: Semi-Supervised Video Transformer for Surgical Phase Recognition

URL: http://arxiv.org/abs/2506.01471v1
Date: Mon, 02 Jun 2025 09:32:12 GMT
Title: SemiVT-Surge: Semi-Supervised Video Transformer for Surgical Phase Recognition
Authors: Yiping Li, Ronald de Jong, Sahar Nasirihaghighi, Tim Jaspers, Romy van Jaarsveld, Gino Kuiper, Richard van Hillegersberg, Fons van der Sommen, Jelle Ruurda, Marcel Breeuwer, Yasmina Al Khalil,
Abstract summary: We propose a video transformer-based model with a robust pseudo-labeling framework.<n>By incorporating unlabeled data, we achieve state-of-the-art performance on RAMIE with a 4.9% accuracy increase.<n>Our findings establish a strong benchmark for semi-supervised surgical phase recognition.
Score: 2.764986157003598
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Accurate surgical phase recognition is crucial for computer-assisted interventions and surgical video analysis. Annotating long surgical videos is labor-intensive, driving research toward leveraging unlabeled data for strong performance with minimal annotations. Although self-supervised learning has gained popularity by enabling large-scale pretraining followed by fine-tuning on small labeled subsets, semi-supervised approaches remain largely underexplored in the surgical domain. In this work, we propose a video transformer-based model with a robust pseudo-labeling framework. Our method incorporates temporal consistency regularization for unlabeled data and contrastive learning with class prototypes, which leverages both labeled data and pseudo-labels to refine the feature space. Through extensive experiments on the private RAMIE (Robot-Assisted Minimally Invasive Esophagectomy) dataset and the public Cholec80 dataset, we demonstrate the effectiveness of our approach. By incorporating unlabeled data, we achieve state-of-the-art performance on RAMIE with a 4.9% accuracy increase and obtain comparable results to full supervision while using only 1/4 of the labeled data on Cholec80. Our findings establish a strong benchmark for semi-supervised surgical phase recognition, paving the way for future research in this domain.

Related papers

CathFlow: Self-Supervised Segmentation of Catheters in Interventional Ultrasound Using Optical Flow and Transformers [66.15847237150909]
We introduce a self-supervised deep learning architecture to segment catheters in longitudinal ultrasound images. The network architecture builds upon AiAReSeg, a segmentation transformer built with the Attention in Attention mechanism. We validated our model on a test dataset, consisting of unseen synthetic data and images collected from silicon aorta phantoms.
arXiv Detail & Related papers (2024-03-21T15:13:36Z)
Jumpstarting Surgical Computer Vision [2.585559512929966]
We develop recommendations for pretraining dataset composition through over 300 experiments.<n>We outperform state-of-the-art pre-trainings on two public benchmarks for phase recognition.
arXiv Detail & Related papers (2023-12-10T18:54:16Z)
Benchmarking Pathology Feature Extractors for Whole Slide Image Classification [2.173830337391778]
Weakly supervised whole slide image classification is a key task in computational pathology. We conduct a comprehensive benchmarking of feature extractors to answer three critical questions. We observe empirically, and by analysing the latent space, that skipping stain normalisation and image augmentations does not degrade performance. We develop a novel evaluation metric to compare relative downstream performance, and show that the choice of feature extractor is the most consequential factor for downstream performance.
arXiv Detail & Related papers (2023-11-20T13:58:26Z)
Shifting to Machine Supervision: Annotation-Efficient Semi and Self-Supervised Learning for Automatic Medical Image Segmentation and Classification [9.67209046726903]
We introduce the S4MI pipeline, a novel approach that leverages advancements in self-supervised and semi-supervised learning. Our study benchmarks these techniques on three distinct medical imaging datasets to evaluate their effectiveness in classification and segmentation tasks. Remarkably, the semi-supervised approach demonstrated superior outcomes in segmentation, outperforming fully-supervised methods while using 50% fewer labels across all datasets.
arXiv Detail & Related papers (2023-11-17T04:04:29Z)
Dual-Decoder Consistency via Pseudo-Labels Guided Data Augmentation for Semi-Supervised Medical Image Segmentation [13.707121013895929]
We present a novel semi-supervised learning method, Dual-Decoder Consistency via Pseudo-Labels Guided Data Augmentation. We use distinct decoders for student and teacher networks while maintain the same encoder. To learn from unlabeled data, we create pseudo-labels generated by the teacher networks and augment the training data with the pseudo-labels.
arXiv Detail & Related papers (2023-08-31T09:13:34Z)
AutoLaparo: A New Dataset of Integrated Multi-tasks for Image-guided Surgical Automation in Laparoscopic Hysterectomy [42.20922574566824]
We present and release the first integrated dataset with multiple image-based perception tasks to facilitate learning-based automation in hysterectomy surgery. Our AutoLaparo dataset is developed based on full-length videos of entire hysterectomy procedures. Specifically, three different yet highly correlated tasks are formulated in the dataset, including surgical workflow recognition, laparoscope motion prediction, and instrument and key anatomy segmentation.
arXiv Detail & Related papers (2022-08-03T13:17:23Z)
Pseudo-label Guided Cross-video Pixel Contrast for Robotic Surgical Scene Segmentation with Limited Annotations [72.15956198507281]
We propose PGV-CL, a novel pseudo-label guided cross-video contrast learning method to boost scene segmentation. We extensively evaluate our method on a public robotic surgery dataset EndoVis18 and a public cataract dataset CaDIS.
arXiv Detail & Related papers (2022-07-20T05:42:19Z)
Dissecting Self-Supervised Learning Methods for Surgical Computer Vision [51.370873913181605]
Self-Supervised Learning (SSL) methods have begun to gain traction in the general computer vision community. The effectiveness of SSL methods in more complex and impactful domains, such as medicine and surgery, remains limited and unexplored. We present an extensive analysis of the performance of these methods on the Cholec80 dataset for two fundamental and popular tasks in surgical context understanding, phase recognition and tool presence detection.
arXiv Detail & Related papers (2022-07-01T14:17:11Z)
Deep Semi-supervised Metric Learning with Dual Alignment for Cervical Cancer Cell Detection [49.78612417406883]
We propose a novel semi-supervised deep metric learning method for cervical cancer cell detection. Our model learns an embedding metric space and conducts dual alignment of semantic features on both the proposal and prototype levels. We construct a large-scale dataset for semi-supervised cervical cancer cell detection for the first time, consisting of 240,860 cervical cell images.
arXiv Detail & Related papers (2021-04-07T17:11:27Z)
Co-Generation and Segmentation for Generalized Surgical Instrument Segmentation on Unlabelled Data [49.419268399590045]
Surgical instrument segmentation for robot-assisted surgery is needed for accurate instrument tracking and augmented reality overlays. Deep learning-based methods have shown state-of-the-art performance for surgical instrument segmentation, but their results depend on labelled data. In this paper, we demonstrate the limited generalizability of these methods on different datasets, including human robot-assisted surgeries.
arXiv Detail & Related papers (2021-03-16T18:41:18Z)
LRTD: Long-Range Temporal Dependency based Active Learning for Surgical Workflow Recognition [67.86810761677403]
We propose a novel active learning method for cost-effective surgical video analysis. Specifically, we propose a non-local recurrent convolutional network (NL-RCNet), which introduces non-local block to capture the long-range temporal dependency. We validate our approach on a large surgical video dataset (Cholec80) by performing surgical workflow recognition task.
arXiv Detail & Related papers (2020-04-21T09:21:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.