Temporal Cluster Assignment for Efficient Real-Time Video Segmentation
- URL: http://arxiv.org/abs/2508.05851v1
- Date: Thu, 07 Aug 2025 20:52:49 GMT
- Title: Temporal Cluster Assignment for Efficient Real-Time Video Segmentation
- Authors: Ka-Wai Yung, Felix J. S. Bragman, Jialang Xu, Imanol Luengo, Danail Stoyanov, Evangelos B. Mazomenos,
- Abstract summary: Vision Transformers have substantially advanced the capabilities of segmentation models across both image and video domains.<n>The window-based attention mechanism of Swin requires a fixed number of tokens per window, limiting the applicability of conventional pruning techniques.<n>We introduce Temporal Cluster Assignment (TCA), a lightweight and effective, fine-tuning-free strategy that enhances token clustering by leveraging temporal coherence.
- Score: 9.248291541710781
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers have substantially advanced the capabilities of segmentation models across both image and video domains. Among them, the Swin Transformer stands out for its ability to capture hierarchical, multi-scale representations, making it a popular backbone for segmentation in videos. However, despite its window-attention scheme, it still incurs a high computational cost, especially in larger variants commonly used for dense prediction in videos. This remains a major bottleneck for real-time, resource-constrained applications. Whilst token reduction methods have been proposed to alleviate this, the window-based attention mechanism of Swin requires a fixed number of tokens per window, limiting the applicability of conventional pruning techniques. Meanwhile, training-free token clustering approaches have shown promise in image segmentation while maintaining window consistency. Nevertheless, they fail to exploit temporal redundancy, missing a key opportunity to further optimize video segmentation performance. We introduce Temporal Cluster Assignment (TCA), a lightweight and effective, fine-tuning-free strategy that enhances token clustering by leveraging temporal coherence across frames. Instead of indiscriminately dropping redundant tokens, TCA refines token clusters using temporal correlations, thereby retaining fine-grained details while significantly reducing computation. Extensive evaluations on YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and a private surgical video dataset show that TCA consistently boosts the accuracy-speed trade-off of existing clustering-based methods. Our results demonstrate that TCA generalizes competently across both natural and domain-specific videos.
Related papers
- Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models [61.11154533305096]
Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens.<n>We propose a new perspective that elaborates token textbfAnchors within intra-frame and inter-frame contexts.<n>Our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs.
arXiv Detail & Related papers (2026-03-02T03:06:40Z) - Fast SAM2 with Text-Driven Token Pruning [52.8350457627401]
Segment Anything Model 2 (SAM2), a vision computation model has significantly advanced in prompt-driven video object segmentation.<n>SAM2 pipelines propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object.<n>We introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation.
arXiv Detail & Related papers (2025-12-24T18:59:05Z) - Token Merging via Spatiotemporal Information Mining for Surgical Video Understanding [32.4892900455388]
We propose video understanding token merging (STIM-TM) method, representing the first dedicated approach for surgical understanding tasks.<n>STIM-TM introduces a decoupled strategy that reduces token redundancy along temporal and spatial dimensions independently.<n> operating in a training-free manner, STIM-TM achieves significant efficiency with over $65$ GFLOPs reduction while preserving competitive accuracy across comprehensive surgical video tasks.
arXiv Detail & Related papers (2025-09-28T06:24:57Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Improving Weakly-supervised Video Instance Segmentation by Leveraging Spatio-temporal Consistency [9.115508086522887]
We introduce a weakly-supervised method called Eigen VIS that achieves competitive accuracy compared to other VIS approaches.
This method is based on two key innovations: a Temporal Eigenvalue Loss (TEL) and a clip-level Quality Co-efficient (QCC)
The code is available on https://github.com/farnooshar/EigenVIS.
arXiv Detail & Related papers (2024-08-29T16:05:05Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z) - Distortion-Aware Network Pruning and Feature Reuse for Real-time Video
Segmentation [49.17930380106643]
We propose a novel framework to speed up any architecture with skip-connections for real-time vision tasks.
Specifically, at the arrival of each frame, we transform the features from the previous frame to reuse them at specific spatial bins.
We then perform partial computation of the backbone network on the regions of the current frame that captures temporal differences between the current and previous frame.
arXiv Detail & Related papers (2022-06-20T07:20:02Z) - CenterCLIP: Token Clustering for Efficient Text-Video Retrieval [67.21528544724546]
In CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive frames in videos.
This significantly increases computation costs and hinders the deployment of video retrieval models in web applications.
In this paper, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
arXiv Detail & Related papers (2022-05-02T12:02:09Z) - Temporal-attentive Covariance Pooling Networks for Video Recognition [52.853765492522655]
existing video architectures usually generate global representation by using a simple global average pooling (GAP) method.
This paper proposes a attentive Covariance Pooling( TCP- TCP), inserted at the end of deep architectures, to produce powerful video representations.
Our TCP is model-agnostic and can be flexibly integrated into any video architectures, resulting in TCPNet for effective video recognition.
arXiv Detail & Related papers (2021-10-27T12:31:29Z) - Unsupervised Action Segmentation by Joint Representation Learning and
Online Clustering [10.057155889852174]
We present a novel approach for unsupervised activity segmentation which uses video frame clustering as a pretext task.
We leverage temporal information in videos by employing temporal optimal transport.
Our approach performs on par with or better than previous methods, despite having significantly less memory constraints.
arXiv Detail & Related papers (2021-05-27T17:57:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.