Token Merging via Spatiotemporal Information Mining for Surgical Video Understanding
- URL: http://arxiv.org/abs/2509.23672v1
- Date: Sun, 28 Sep 2025 06:24:57 GMT
- Title: Token Merging via Spatiotemporal Information Mining for Surgical Video Understanding
- Authors: Xixi Jiang, Chen Yang, Dong Zhang, Pingcheng Dong, Xin Yang, Kwang-Ting Cheng,
- Abstract summary: We propose video understanding token merging (STIM-TM) method, representing the first dedicated approach for surgical understanding tasks.<n>STIM-TM introduces a decoupled strategy that reduces token redundancy along temporal and spatial dimensions independently.<n> operating in a training-free manner, STIM-TM achieves significant efficiency with over $65$ GFLOPs reduction while preserving competitive accuracy across comprehensive surgical video tasks.
- Score: 32.4892900455388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformer models have shown impressive effectiveness in the surgical video understanding tasks through long-range dependency modeling. However, current methods suffer from prohibitive computational costs due to processing massive spatiotemporal tokens across video frames. While prior work on token merging has advanced model efficiency, they fail to adequately consider the inherent spatiotemporal structure of video data and overlook the heterogeneous nature of information distribution, leading to suboptimal performance. In this paper, we propose a spatiotemporal information mining token merging (STIM-TM) method, representing the first dedicated approach for surgical video understanding. STIM-TM introduces a decoupled strategy that reduces token redundancy along temporal and spatial dimensions independently. Specifically, the temporal component merges spatially corresponding tokens from consecutive frames using saliency weighting, preserving critical sequential information and maintaining continuity. Meanwhile, the spatial component prioritizes merging static tokens through temporal stability analysis, protecting dynamic regions containing essential surgical information. Operating in a training-free manner, STIM-TM achieves significant efficiency gains with over $65\%$ GFLOPs reduction while preserving competitive accuracy across comprehensive surgical video tasks. Our method also supports efficient training of long-sequence surgical videos, addressing computational bottlenecks in surgical applications.
Related papers
- Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models [61.11154533305096]
Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens.<n>We propose a new perspective that elaborates token textbfAnchors within intra-frame and inter-frame contexts.<n>Our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs.
arXiv Detail & Related papers (2026-03-02T03:06:40Z) - Multimodal Optimal Transport for Unsupervised Temporal Segmentation in Surgical Robotics [2.582839864045357]
Recognizing surgical phases and steps from video is a fundamental problem in computer-assisted interventions.<n>Recent approaches increasingly rely on large-scale pre-training on thousands of labeled surgical videos, followed by zero-shot transfer to specific procedures.<n>We propose Text-Augmented Action Optimal Transport (TASOT), an unsupervised method for surgical phase and step recognition.
arXiv Detail & Related papers (2026-02-27T16:15:58Z) - Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential [26.958261975749974]
We propose textitSpikeSurgSeg, the first spike-driven video Transformer framework tailored for surgical scene segmentation.<n>SpikeSurgSeg achieves most mIoU comparable to SOTA ANN-based models while reducing inference latency by at least $8times$.
arXiv Detail & Related papers (2025-12-24T17:05:09Z) - Exploiting Temporal State Space Sharing for Video Semantic Segmentation [53.8810901249897]
Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes.<n>Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements.<n>We introduce a Temporal Video State Space Sharing architecture to leverage Mamba state space models for temporal feature sharing.<n>Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool.
arXiv Detail & Related papers (2025-03-26T01:47:42Z) - GLSFormer : Gated - Long, Short Sequence Transformer for Step
Recognition in Surgical Videos [57.93194315839009]
We propose a vision transformer-based approach to learn temporal features directly from sequence-level patches.
We extensively evaluate our approach on two cataract surgery video datasets, Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods.
arXiv Detail & Related papers (2023-07-20T17:57:04Z) - TUNeS: A Temporal U-Net with Self-Attention for Video-based Surgical Phase Recognition [1.5237530964650965]
We propose a novel approach that uses attention more effectively and does not require hand-crafted constraints.<n>TuNeS is an efficient and simple temporal model that incorporates self-attention at the core of a convolutional U-Net structure.<n>TuNeS achieves state-of-the-art results on the Cholec80 dataset.
arXiv Detail & Related papers (2023-07-19T14:10:55Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - Efficient Global-Local Memory for Real-time Instrument Segmentation of
Robotic Surgical Video [53.14186293442669]
We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration.
We propose a novel dual-memory network (DMNet) to relate both global and local-temporal knowledge.
Our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.
arXiv Detail & Related papers (2021-09-28T10:10:14Z) - Temporal Memory Relation Network for Workflow Recognition from Surgical
Video [53.20825496640025]
We propose a novel end-to-end temporal memory relation network (TMNet) for relating long-range and multi-scale temporal patterns.
We have extensively validated our approach on two benchmark surgical video datasets.
arXiv Detail & Related papers (2021-03-30T13:20:26Z) - Symmetric Dilated Convolution for Surgical Gesture Recognition [10.699258974625073]
We propose a novel temporal convolutional architecture to automatically detect and segment surgical gestures.
We devise our method with a symmetric dilation structure bridged by a self-attention module to encode and decode the long-term temporal patterns.
We validate our approach on a fundamental robotic suturing task from the JIGSAWS dataset.
arXiv Detail & Related papers (2020-07-13T13:34:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.