MoSFormer: Augmenting Temporal Context with Memory of Surgery for Surgical Phase Recognition
- URL: http://arxiv.org/abs/2503.00695v1
- Date: Sun, 02 Mar 2025 02:26:21 GMT
- Title: MoSFormer: Augmenting Temporal Context with Memory of Surgery for Surgical Phase Recognition
- Authors: Hao Ding, Xu Lian, Mathias Unberath,
- Abstract summary: Memory of Surgery (MoS) is a framework that enriches temporal modeling by incorporating both semantic interpretable long-term surgical history and short-term impressions.<n>MoSFormer demonstrates state-of-the-art performance on multiple benchmarks.
- Score: 6.913838841605972
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Surgical phase recognition from video enables various downstream applications. Transformer-based sliding window approaches have set the state-of-the-art by capturing rich spatial-temporal features. However, while transformers can theoretically handle arbitrary-length sequences, in practice they are limited by memory and compute constraints, resulting in fixed context windows that struggle with maintaining temporal consistency across lengthy surgical procedures. This often leads to fragmented predictions and limited procedure-level understanding. To address these challenges, we propose Memory of Surgery (MoS), a framework that enriches temporal modeling by incorporating both semantic interpretable long-term surgical history and short-term impressions. MoSFormer, our enhanced transformer architecture, integrates MoS using a carefully designed encoding and fusion mechanism. We further introduce step filtering to refine history representation and develop a memory caching pipeline to improve training and inference stability, mitigating shortcut learning and overfitting. MoSFormer demonstrates state-of-the-art performance on multiple benchmarks. On the Challenging BernBypass70 benchmark, it attains 88.0 video-level accuracy and phase-level metrics of 70.7 precision, 68.7 recall, and 66.3 F1 score, outperforming its baseline with 2.1 video-level accuracy and phase-level metrics of 4.6 precision, 3.6 recall, and 3.8 F1 score. Further studies confirms the individual and combined benefits of long-term and short-term memory components through ablation and counterfactual inference. Qualitative results shows improved temporal consistency. The augmented temporal context enables procedure-level understanding, paving the way for more comprehensive surgical video analysis.
Related papers
- Exploiting Temporal State Space Sharing for Video Semantic Segmentation [53.8810901249897]
Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes.
Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements.
We introduce a Temporal Video State Space Sharing architecture to leverage Mamba state space models for temporal feature sharing.
Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool.
arXiv Detail & Related papers (2025-03-26T01:47:42Z) - MuST: Multi-Scale Transformers for Surgical Phase Recognition [40.047145788604716]
Phase recognition in surgical videos is crucial for enhancing computer-aided surgical systems.
Existing methods often rely on fixed temporal windows for video analysis to identify dynamic surgical phases.
We propose Multi-Scale Transformers for Surgical Phase Recognition (MuST), a novel Transformer-based approach.
arXiv Detail & Related papers (2024-07-24T15:38:20Z) - GLSFormer : Gated - Long, Short Sequence Transformer for Step
Recognition in Surgical Videos [57.93194315839009]
We propose a vision transformer-based approach to learn temporal features directly from sequence-level patches.
We extensively evaluate our approach on two cataract surgery video datasets, Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods.
arXiv Detail & Related papers (2023-07-20T17:57:04Z) - LoViT: Long Video Transformer for Surgical Phase Recognition [59.06812739441785]
We present a two-stage method, called Long Video Transformer (LoViT) for fusing short- and long-term temporal information.
Our approach outperforms state-of-the-art methods on the Cholec80 and AutoLaparo datasets consistently.
arXiv Detail & Related papers (2023-05-15T20:06:14Z) - Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS
Instance Segmentation [10.789826145990016]
This paper presents a deep learning framework for medical video segmentation.
Our framework explicitly extracts features from neighbouring frames across the temporal dimension.
It incorporates them with a temporal feature blender, which then tokenises the high-level-temporal feature to form a strong global feature encoded via a Swin Transformer.
arXiv Detail & Related papers (2023-02-22T12:09:39Z) - Efficient Global-Local Memory for Real-time Instrument Segmentation of
Robotic Surgical Video [53.14186293442669]
We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration.
We propose a novel dual-memory network (DMNet) to relate both global and local-temporal knowledge.
Our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.
arXiv Detail & Related papers (2021-09-28T10:10:14Z) - Temporal Memory Relation Network for Workflow Recognition from Surgical
Video [53.20825496640025]
We propose a novel end-to-end temporal memory relation network (TMNet) for relating long-range and multi-scale temporal patterns.
We have extensively validated our approach on two benchmark surgical video datasets.
arXiv Detail & Related papers (2021-03-30T13:20:26Z) - Trans-SVNet: Accurate Phase Recognition from Surgical Videos via Hybrid
Embedding Aggregation Transformer [57.18185972461453]
We introduce for the first time in surgical workflow analysis Transformer to reconsider the ignored complementary effects of spatial and temporal features for accurate phase recognition.
Our framework is lightweight and processes the hybrid embeddings in parallel to achieve a high inference speed.
arXiv Detail & Related papers (2021-03-17T15:12:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.