Related papers: One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

URL: http://arxiv.org/abs/2505.23617v2
Date: Wed, 09 Jul 2025 18:41:10 GMT
Title: One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory
Authors: Chenhao Zheng, Jieyu Zhang, Mohammadreza Salehi, Ziqi Gao, Vishnu Iyengar, Norimasa Kobori, Quan Kong, Ranjay Krishna,
Abstract summary: We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches.<n>We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens.<n>We show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM.
Score: 25.726492556054904
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. Our method aligns with fundamental perceptual principles, ensuring that tokenization reflects scene complexity rather than video duration. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. Trained with contrastive learning, TrajViT significantly outperforms space-time ViT (ViT3D) across multiple video understanding benchmarks, e.g., TrajViT outperforms ViT3D by a large margin of 6% top-5 recall in average at video-text retrieval task with 10x token deduction. We also show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM, obtaining an average of 5.2% performance improvement across 6 VideoQA benchmarks while having 4x faster training time and 18x less inference FLOPs. TrajViT is the first efficient encoder to consistently outperform ViT3D across diverse video analysis tasks, making it a robust and scalable solution.

Related papers

TrajTok: Learning Trajectory Tokens enables better Video Understanding [63.1260672430712]
Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens.<n>We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective.<n>We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.
arXiv Detail & Related papers (2026-02-26T09:15:34Z)
VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents [33.80068883432077]
This work presents VTok, a unified video tokenization framework that can be used for both generation and understanding tasks.<n>We propose to decouple the spatial and temporal representations of videos by retaining the spatial features of a single key frame while encoding each subsequent frame into a single residual token.<n>Our experiments suggest that VTok effectively reduces the complexity of video representation from the product of frame count and per-frame token count to their sum.
arXiv Detail & Related papers (2026-02-04T04:39:46Z)
Déjà Vu: Efficient Video-Language Query Engine with Learning-based Inter-Frame Computation Reuse [7.283352519499699]
This paper introduces D'eja Vu, a video-language query engine that accelerates ViT-based VideoLMs by reusing computations across consecutive frames.<n>At its core is ReuseViT, a modified ViT model specifically designed for VideoLM tasks, which learns to detect inter-frame reuse opportunities.<n>We show that D'eja Vu accelerates embedding generation by up to a 2.64x within a 2% error bound, dramatically enhancing the practicality of VideoLMs for large-scale video analytics.
arXiv Detail & Related papers (2025-06-17T01:59:10Z)
HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models [63.65066762436074]
HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks.<n>It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks.
arXiv Detail & Related papers (2025-03-14T15:36:39Z)
Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression [78.93023152602525]
Slow inference speed is one of the most crucial concerns for deploying multi-view 3D detectors to tasks with high real-time requirements like autonomous driving. We propose a simple yet effective method called TokenCompression3D (ToC3D) Our method can nearly maintain the performance of recent SOTA with up to 30% inference speedup, and the improvements are consistent after scaling up the ViT and input resolution.
arXiv Detail & Related papers (2024-09-01T06:58:08Z)
AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. We propose to apply adaptive resolution for different regions in the image according to their importance. We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z)
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding [20.16000249533665]
TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame. Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video block. We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks.
arXiv Detail & Related papers (2023-10-29T16:25:32Z)
Efficient Video Action Detection with Token Dropout and Context Refinement [67.10895416008911]
We propose an end-to-end framework for efficient video action detection (ViTs) In a video clip, we maintain tokens from its perspective while preserving tokens relevant to actor motions from other frames. Second, we refine scene context by leveraging remaining tokens for better recognizing actor identities.
arXiv Detail & Related papers (2023-04-17T17:21:21Z)
Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Existing approaches usually align and aggregate video frames from limited adjacent frames. We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z)
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data. We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z)
VidTr: Video Transformer Without Convolutions [32.710988574799735]
We introduce Video Transformer (VidTr) with separable-attention fortemporal video classification. VidTr is able to aggregate-temporal information via stacked attentions and provide better performance with higher efficiency.
arXiv Detail & Related papers (2021-04-23T17:59:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.