Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential
- URL: http://arxiv.org/abs/2512.21284v1
- Date: Wed, 24 Dec 2025 17:05:09 GMT
- Title: Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential
- Authors: Shihao Zou, Jingjing Li, Wei Ji, Jincai Huang, Kai Wang, Guo Dan, Weixin Si, Yi Pan,
- Abstract summary: We propose textitSpikeSurgSeg, the first spike-driven video Transformer framework tailored for surgical scene segmentation.<n>SpikeSurgSeg achieves most mIoU comparable to SOTA ANN-based models while reducing inference latency by at least $8times$.
- Score: 26.958261975749974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern surgical systems increasingly rely on intelligent scene understanding to provide timely situational awareness for enhanced intra-operative safety. Within this pipeline, surgical scene segmentation plays a central role in accurately perceiving operative events. Although recent deep learning models, particularly large-scale foundation models, achieve remarkable segmentation accuracy, their substantial computational demands and power consumption hinder real-time deployment in resource-constrained surgical environments. To address this limitation, we explore the emerging SNN as a promising paradigm for highly efficient surgical intelligence. However, their performance is still constrained by the scarcity of labeled surgical data and the inherently sparse nature of surgical video representations. To this end, we propose \textit{SpikeSurgSeg}, the first spike-driven video Transformer framework tailored for surgical scene segmentation with real-time potential on non-GPU platforms. To address the limited availability of surgical annotations, we introduce a surgical-scene masked autoencoding pretraining strategy for SNNs that enables robust spatiotemporal representation learning via layer-wise tube masking. Building on this pretrained backbone, we further adopt a lightweight spike-driven segmentation head that produces temporally consistent predictions while preserving the low-latency characteristics of SNNs. Extensive experiments on EndoVis18 and our in-house SurgBleed dataset demonstrate that SpikeSurgSeg achieves mIoU comparable to SOTA ANN-based models while reducing inference latency by at least $8\times$. Notably, it delivers over $20\times$ acceleration relative to most foundation-model baselines, underscoring its potential for time-critical surgical scene segmentation.
Related papers
- Token Merging via Spatiotemporal Information Mining for Surgical Video Understanding [32.4892900455388]
We propose video understanding token merging (STIM-TM) method, representing the first dedicated approach for surgical understanding tasks.<n>STIM-TM introduces a decoupled strategy that reduces token redundancy along temporal and spatial dimensions independently.<n> operating in a training-free manner, STIM-TM achieves significant efficiency with over $65$ GFLOPs reduction while preserving competitive accuracy across comprehensive surgical video tasks.
arXiv Detail & Related papers (2025-09-28T06:24:57Z) - Future Slot Prediction for Unsupervised Object Discovery in Surgical Video [10.984331138780682]
Object-centric slot attention is an emerging paradigm for unsupervised learning of structured, interpretable object-centric representations.<n>Current approaches with an adaptive slot count perform well on images, but their performance on surgical videos is low.<n>We propose a dynamic temporal slot transformer (DTST) module that is trained both for temporal reasoning and for predicting the optimal future slot.
arXiv Detail & Related papers (2025-07-02T16:52:16Z) - $\mathsf{CSMAE~}$:~Cataract Surgical Masked Autoencoder (MAE) based Pre-training [25.71088804562768]
We introduce a Masked Autoencoder (MAE)-based pretraining approach, specifically developed for Cataract Surgery video analysis.<n>Instead of randomly selecting tokens for masking they are selected based on the importance of the token token.<n>This approach surpasses current state-of-the-art self-supervised pretraining and adapter-based learning methods by a significant margin.
arXiv Detail & Related papers (2025-02-12T22:24:49Z) - Vivim: a Video Vision Mamba for Medical Video Segmentation [52.11785024350253]
This paper presents a Video Vision Mamba-based framework, dubbed as Vivim, for medical video segmentation tasks.
Our Vivim can effectively compress the long-term representation into sequences at varying scales.
Experiments on thyroid segmentation, breast lesion segmentation in ultrasound videos, and polyp segmentation in colonoscopy videos demonstrate the effectiveness and efficiency of our Vivim.
arXiv Detail & Related papers (2024-01-25T13:27:03Z) - Efficient Deformable Tissue Reconstruction via Orthogonal Neural Plane [58.871015937204255]
We introduce Fast Orthogonal Plane (plane) for the reconstruction of deformable tissues.
We conceptualize surgical procedures as 4D volumes, and break them down into static and dynamic fields comprised of neural planes.
This factorization iscretizes four-dimensional space, leading to a decreased memory usage and faster optimization.
arXiv Detail & Related papers (2023-12-23T13:27:50Z) - GLSFormer : Gated - Long, Short Sequence Transformer for Step
Recognition in Surgical Videos [57.93194315839009]
We propose a vision transformer-based approach to learn temporal features directly from sequence-level patches.
We extensively evaluate our approach on two cataract surgery video datasets, Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods.
arXiv Detail & Related papers (2023-07-20T17:57:04Z) - Neural LerPlane Representations for Fast 4D Reconstruction of Deformable
Tissues [52.886545681833596]
LerPlane is a novel method for fast and accurate reconstruction of surgical scenes under a single-viewpoint setting.
LerPlane treats surgical procedures as 4D volumes and factorizes them into explicit 2D planes of static and dynamic fields.
LerPlane shares static fields, significantly reducing the workload of dynamic tissue modeling.
arXiv Detail & Related papers (2023-05-31T14:38:35Z) - Temporally Constrained Neural Networks (TCNN): A framework for
semi-supervised video semantic segmentation [5.0754434714665715]
We present Temporally Constrained Neural Networks (TCNN), a semi-supervised framework used for video semantic segmentation of surgical videos.
In this work, we show that autoencoder networks can be used to efficiently provide both spatial and temporal supervisory signals.
We demonstrate that lower-dimensional representations of predicted masks can be leveraged to provide a consistent improvement on both sparsely labeled datasets.
arXiv Detail & Related papers (2021-12-27T18:06:12Z) - Efficient Global-Local Memory for Real-time Instrument Segmentation of
Robotic Surgical Video [53.14186293442669]
We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration.
We propose a novel dual-memory network (DMNet) to relate both global and local-temporal knowledge.
Our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.
arXiv Detail & Related papers (2021-09-28T10:10:14Z) - SurgeonAssist-Net: Towards Context-Aware Head-Mounted Display-Based
Augmented Reality for Surgical Guidance [18.060445966264727]
SurgeonAssist-Net is a framework making action-and-workflow-driven virtual assistance accessible to commercially available optical see-through head-mounted displays (OST-HMDs)
Our implementation competes with state-of-the-art approaches in prediction accuracy for automated task recognition.
It is capable of near real-time performance on the Microsoft HoloLens 2 OST-HMD.
arXiv Detail & Related papers (2021-07-13T21:12:34Z) - LRTD: Long-Range Temporal Dependency based Active Learning for Surgical
Workflow Recognition [67.86810761677403]
We propose a novel active learning method for cost-effective surgical video analysis.
Specifically, we propose a non-local recurrent convolutional network (NL-RCNet), which introduces non-local block to capture the long-range temporal dependency.
We validate our approach on a large surgical video dataset (Cholec80) by performing surgical workflow recognition task.
arXiv Detail & Related papers (2020-04-21T09:21:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.