Related papers: Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction

Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction

URL: http://arxiv.org/abs/2411.14762v2
Date: Tue, 26 Nov 2024 14:03:14 GMT
Title: Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction
Authors: Huiwon Jang, Sihyun Yu, Jinwoo Shin, Pieter Abbeel, Younggyo Seo,
Abstract summary: CoordTok is a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos. CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled $(x,y,t)$ coordinates.
Score: 93.69757398746017
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Efficient tokenization of videos remains a challenge in training vision models that can process long videos. One promising direction is to develop a tokenizer that can encode long video clips, as it would enable the tokenizer to leverage the temporal coherence of videos better for tokenization. However, training existing tokenizers on long videos often incurs a huge training cost as they are trained to reconstruct all the frames at once. In this paper, we introduce CoordTok, a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos, inspired by recent advances in 3D generative models. In particular, CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled $(x,y,t)$ coordinates. This allows for training large tokenizer models directly on long videos without requiring excessive training resources. Our experiments show that CoordTok can drastically reduce the number of tokens for encoding long video clips. For instance, CoordTok can encode a 128-frame video with 128$\times$128 resolution into 1280 tokens, while baselines need 6144 or 8192 tokens to achieve similar reconstruction quality. We further show that this efficient video tokenization enables memory-efficient training of a diffusion transformer that can generate 128 frames at once.

Related papers

Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space [84.22182151122598]
AdapTok is an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content.<n>AdapTok consistently improves reconstruction quality and generation performance under different token budgets.
arXiv Detail & Related papers (2025-05-22T17:59:02Z)
Multimodal Long Video Modeling Based on Temporal Dynamic Context [13.979661295432964]
We propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC) We segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. To handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments.
arXiv Detail & Related papers (2025-04-14T17:34:06Z)
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding [12.215829700340988]
Video-XL-Pro is an efficient method for extremely long video understanding. Video-XL-Pro can process over 8K frames on a single A100 GPU.
arXiv Detail & Related papers (2025-03-24T09:21:48Z)
HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models [63.65066762436074]
HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks. It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks.
arXiv Detail & Related papers (2025-03-14T15:36:39Z)
Extending Video Masked Autoencoders to 128 frames [75.01251612160829]
Video understanding has witnessed significant progress with recent video foundation models demonstrating strong performance owing to self-supervised pre-training objectives; Masked Autoencoders (MAE) being the design of choice. However, the majority of prior works that leverage MAE pre-training have focused on relatively short video representations (16 / 32 frames in length) largely due to hardware memory and compute limitations that scale poorly with video length due to the dense memory-intensive self-attention decoding. We propose an effective strategy for prioritizing tokens which allows training on longer video sequences (128 frames) and gets better performance than, more typical, random
arXiv Detail & Related papers (2024-11-20T20:00:38Z)
ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens. During inference, ElasticTok can dynamically allocate tokens when needed. Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding [20.16000249533665]
TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame. Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video block. We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks.
arXiv Detail & Related papers (2023-10-29T16:25:32Z)
UMIFormer: Mining the Correlations between Similar Tokens for Multi-View 3D Reconstruction [9.874357856580447]
We propose a novel transformer network for Unstructured Multiple Images (UMIFormer) It exploits transformer blocks for decoupled intra-view encoding and designed blocks for token rectification. All tokens acquired from various branches are compressed into a fixed-size compact representation.
arXiv Detail & Related papers (2023-02-27T17:27:45Z)
Phenaki: Variable Length Video Generation From Open Domain Textual Description [21.610541668826006]
Phenaki is a model capable of realistic video synthesis given a sequence of textual prompts. New model for learning video representation compresses the video to a small representation of discrete tokens. To the best of our knowledge, this is the first time a paper studies generating videos from time variable prompts.
arXiv Detail & Related papers (2022-10-05T17:18:28Z)
Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Existing approaches usually align and aggregate video frames from limited adjacent frames. We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z)
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning [82.09856883441044]
Video understanding relies on perceiving the global content modeling its internal connections. We propose a block-wise strategy where we mask neighboring video tokens in both spatial and temporal domains. We also add an augmentation-free contrastive learning method to further capture global content.
arXiv Detail & Related papers (2021-06-21T16:48:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.