Related papers: Recognizing Surgical Phases Anywhere: Few-Shot Test-time Adaptation and Task-graph Guided Refinement

Recognizing Surgical Phases Anywhere: Few-Shot Test-time Adaptation and Task-graph Guided Refinement

URL: http://arxiv.org/abs/2506.20254v2
Date: Mon, 14 Jul 2025 20:51:36 GMT
Title: Recognizing Surgical Phases Anywhere: Few-Shot Test-time Adaptation and Task-graph Guided Refinement
Authors: Kun Yuan, Tingxuan Chen, Shi Li, Joel L. Lavanchy, Christian Heiliger, Ege Özsoy, Yiming Huang, Long Bai, Nassir Navab, Vinkle Srivastav, Hongliang Ren, Nicolas Padoy,
Abstract summary: Surgical Phase Anywhere (SPA) is a lightweight framework for versatile surgical workflow understanding.<n>It adapts foundation models to institutional settings with minimal annotation.<n>It achieves state-of-the-art performance in few-shot surgical phase recognition across multiple institutions and procedures.
Score: 43.44675567476855
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The complexity and diversity of surgical workflows, driven by heterogeneous operating room settings, institutional protocols, and anatomical variability, present a significant challenge in developing generalizable models for cross-institutional and cross-procedural surgical understanding. While recent surgical foundation models pretrained on large-scale vision-language data offer promising transferability, their zero-shot performance remains constrained by domain shifts, limiting their utility in unseen surgical environments. To address this, we introduce Surgical Phase Anywhere (SPA), a lightweight framework for versatile surgical workflow understanding that adapts foundation models to institutional settings with minimal annotation. SPA leverages few-shot spatial adaptation to align multi-modal embeddings with institution-specific surgical scenes and phases. It also ensures temporal consistency through diffusion modeling, which encodes task-graph priors derived from institutional procedure protocols. Finally, SPA employs dynamic test-time adaptation, exploiting the mutual agreement between multi-modal phase prediction streams to adapt the model to a given test video in a self-supervised manner, enhancing the reliability under test-time distribution shifts. SPA is a lightweight adaptation framework, allowing hospitals to rapidly customize phase recognition models by defining phases in natural language text, annotating a few images with the phase labels, and providing a task graph defining phase transitions. The experimental results show that the SPA framework achieves state-of-the-art performance in few-shot surgical phase recognition across multiple institutions and procedures, even outperforming full-shot models with 32-shot labeled data. Code is available at https://github.com/CAMMA-public/SPA

Related papers

UniSegDiff: Boosting Unified Lesion Segmentation via a Staged Diffusion Model [53.34835793648352]
We propose UniSegDiff, a novel diffusion model framework for lesion segmentation.<n>UniSegDiff addresses lesion segmentation in a unified manner across multiple modalities and organs.<n> Comprehensive experimental results demonstrate that UniSegDiff significantly outperforms previous state-of-the-art (SOTA) approaches.
arXiv Detail & Related papers (2025-07-24T12:33:10Z)
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z)
Efficient MedSAMs: Segment Anything in Medical Images on Laptop [69.28565867103542]
We organized the first international competition dedicated to promptable medical image segmentation.<n>The top teams developed lightweight segmentation foundation models and implemented an efficient inference pipeline.<n>The best-performing algorithms have been incorporated into the open-source software with a user-friendly interface to facilitate clinical adoption.
arXiv Detail & Related papers (2024-12-20T17:33:35Z)
Neural Finite-State Machines for Surgical Phase Recognition [30.912252237906724]
Surgical phase recognition is crucial for applications in workflow optimization, performance evaluation, and real-time intervention guidance.<n>We propose the Neural Finite-State Machine (NFSM), a novel approach that enforces temporal coherence by integrating classical state-transition priors with modern neural networks.<n>We demonstrate state-of-the-art performance across multiple benchmarks, including a significant improvement on the BernBypass70 dataset.
arXiv Detail & Related papers (2024-11-27T03:21:57Z)
Diffusion Transformer Policy [48.50988753948537]
We propose a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, to model continuous end-effector actions.<n>By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets.
arXiv Detail & Related papers (2024-10-21T12:43:54Z)
SurgPETL: Parameter-Efficient Image-to-Surgical-Video Transfer Learning for Surgical Phase Recognition [9.675072799670458]
"Image pre-training followed by video fine-tuning" for high-dimensional video data poses significant performance bottlenecks. In this paper, we develop a parameter-efficient transfer learning benchmark SurgPETL for surgical phase recognition. We conduct extensive experiments with three advanced methods based on ViTs of two distinct scales pre-trained on five large-scale natural and medical datasets.
arXiv Detail & Related papers (2024-09-30T08:33:50Z)
PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation [51.509573838103854]
We propose a semi-supervised learning framework, termed Progressive Mean Teachers (PMT), for medical image segmentation. Our PMT generates high-fidelity pseudo labels by learning robust and diverse features in the training process. Experimental results on two datasets with different modalities, i.e., CT and MRI, demonstrate that our method outperforms the state-of-the-art medical image segmentation approaches.
arXiv Detail & Related papers (2024-09-08T15:02:25Z)
SurgPLAN: Surgical Phase Localization Network for Phase Recognition [14.857715124466594]
We propose a Surgical Phase LocAlization Network, named SurgPLAN, to facilitate a more accurate and stable surgical phase recognition. We first devise a Pyramid SlowFast (PSF) architecture to serve as the visual backbone to capture multi-scale spatial and temporal features by two branches with different frame sampling rates.
arXiv Detail & Related papers (2023-11-16T15:39:01Z)
GLSFormer : Gated - Long, Short Sequence Transformer for Step Recognition in Surgical Videos [57.93194315839009]
We propose a vision transformer-based approach to learn temporal features directly from sequence-level patches. We extensively evaluate our approach on two cataract surgery video datasets, Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods.
arXiv Detail & Related papers (2023-07-20T17:57:04Z)
Multi-Task Prediction of Clinical Outcomes in the Intensive Care Unit using Flexible Multimodal Transformers [4.836546574465437]
We propose a flexible Transformer-based EHR embedding pipeline and predictive model framework. We showcase the feasibility of our flexible design in a case study in the intensive care unit.
arXiv Detail & Related papers (2021-11-09T21:46:11Z)
Not End-to-End: Explore Multi-Stage Architecture for Online Surgical Phase Recognition [11.234115388848284]
We propose a new non end-to-end training strategy for surgical phase recognition task. For the non end-to-end training strategy, the refinement stage is trained separately with proposed two types of disturbed sequences. We evaluate three different choices of refinement models to show that our analysis and solution are robust to the choices of specific multi-stage models.
arXiv Detail & Related papers (2021-07-10T11:00:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.