Related papers: VidEoMT: Your ViT is Secretly Also a Video Segmentation Model

VidEoMT: Your ViT is Secretly Also a Video Segmentation Model

URL: http://arxiv.org/abs/2602.17807v2
Date: Mon, 23 Feb 2026 18:10:18 GMT
Title: VidEoMT: Your ViT is Secretly Also a Video Segmentation Model
Authors: Narges Norouzi, Idil Esen Zulfikar, Niccolò Cavagnero, Tommie Kerssies, Bastian Leibe, Gijs Dubbelman, Daan de Geus,
Abstract summary: Existing online video segmentation models typically combine a per-frame segmenter with complex specialized tracking modules.<n>Recent studies suggest plain Vision Transformer (ViT) encoders can conduct accurate image segmentation without requiring specialized modules.<n>We propose the Video-only Mask Transformer (VidEoMT), a simple encoder-only video segmentation model that eliminates the need for dedicated tracking modules.
Score: 30.92193335524048
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing online video segmentation models typically combine a per-frame segmenter with complex specialized tracking modules. While effective, these modules introduce significant architectural complexity and computational overhead. Recent studies suggest that plain Vision Transformer (ViT) encoders, when scaled with sufficient capacity and large-scale pre-training, can conduct accurate image segmentation without requiring specialized modules. Motivated by this observation, we propose the Video Encoder-only Mask Transformer (VidEoMT), a simple encoder-only video segmentation model that eliminates the need for dedicated tracking modules. To enable temporal modeling in an encoder-only ViT, VidEoMT introduces a lightweight query propagation mechanism that carries information across frames by reusing queries from the previous frame. To balance this with adaptability to new content, it employs a query fusion strategy that combines the propagated queries with a set of temporally-agnostic learned queries. As a result, VidEoMT attains the benefits of a tracker without added complexity, achieving competitive accuracy while being 5x-10x faster, running at up to 160 FPS with a ViT-L backbone. Code: https://www.tue-mps.org/videomt/

Related papers

TrajTok: Learning Trajectory Tokens enables better Video Understanding [63.1260672430712]
Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens.<n>We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective.<n>We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.
arXiv Detail & Related papers (2026-02-26T09:15:34Z)
Déjà Vu: Efficient Video-Language Query Engine with Learning-based Inter-Frame Computation Reuse [13.680753232748705]
This paper introduces D'eja Vu, a video-language query engine that accelerates ViT-based VideoLMs by reusing computations across consecutive frames.<n>At its core is ReuseViT, a modified ViT model specifically designed for VideoLM tasks, which learns to detect inter-frame reuse opportunities.<n>We show that D'eja Vu accelerates embedding generation by up to a 2.64x within a 2% error bound, dramatically enhancing the practicality of VideoLMs for large-scale video analytics.
arXiv Detail & Related papers (2025-06-17T01:59:10Z)
Your ViT is Secretly an Image Segmentation Model [50.71238842539735]
Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks.<n>We show that inductive biases introduced by task-specific components can instead be learned by the ViT itself.<n>We introduce the Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation.
arXiv Detail & Related papers (2025-03-24T19:56:02Z)
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
Spatio-temporal Prompting Network for Robust Video Feature Extraction [74.54597668310707]
Frametemporal is one of the main challenges in the field of video understanding. Recent approaches exploit transformer-based integration modules to obtain quality-of-temporal information. We present a neat and unified framework called N-Temporal Prompting Network (NNSTP) It can efficiently extract video features by adjusting the input features in the network backbone.
arXiv Detail & Related papers (2024-02-04T17:52:04Z)
Multi-entity Video Transformers for Fine-Grained Video Representation Learning [34.26732761916984]
We re-examine the design of transformer architectures for video representation learning.<n>A key aspect of our approach is the improved sharing of scene information in the temporal pipeline.<n>Our Multi-entity Video Transformer (MV-Former) processes the frames as groups of entities represented as tokens linked across time.
arXiv Detail & Related papers (2023-11-17T21:23:12Z)
SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers [76.13755422671822]
This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder-decoder framework. We introduce a novel Attention-to-Mask (atm) module to design a lightweight decoder effective for plain ViT. Our decoder outperforms the popular decoder UPerNet using various ViT backbones while consuming only about $5%$ of the computational cost.
arXiv Detail & Related papers (2023-06-09T22:29:56Z)
Multiview Transformers for Video Recognition [69.50552269271526]
We present Multiview Video Recognition (MTV) for different resolutions. MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost. We achieve state-of-the-art results on five standard datasets, and improve even further with large-scale pretraining.
arXiv Detail & Related papers (2022-01-12T03:33:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.