Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space
- URL: http://arxiv.org/abs/2505.17011v1
- Date: Thu, 22 May 2025 17:59:02 GMT
- Title: Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space
- Authors: Yan Li, Changyao Tian, Renqiu Xia, Ning Liao, Weiwei Guo, Junchi Yan, Hongsheng Li, Jifeng Dai, Hao Li, Xue Yang,
- Abstract summary: AdapTok is an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content.<n>AdapTok consistently improves reconstruction quality and generation performance under different token budgets.
- Score: 84.22182151122598
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a block-wise masking strategy that randomly drops tail tokens of each block during training, and a block causal scorer to predict the reconstruction quality of video frames using different numbers of tokens. During inference, an adaptive token allocation strategy based on integer linear programming is further proposed to adjust token usage given predicted scores. Such design allows for sample-wise, content-aware, and temporally dynamic token allocation under a controllable overall budget. Extensive experiments for video reconstruction and generation on UCF-101 and Kinetics-600 demonstrate the effectiveness of our approach. Without additional image data, AdapTok consistently improves reconstruction quality and generation performance under different token budgets, allowing for more scalable and token-efficient generative video modeling.
Related papers
- LoViC: Efficient Long Video Generation with Context Compression [68.22069741704158]
We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos.<n>At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations.
arXiv Detail & Related papers (2025-07-17T09:46:43Z) - Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models [50.214593234229255]
We introduce the novel task of Extreme Short Token Reduction, which aims to represent entire videos using a minimal set of discrete tokens.<n>On the Extreme Short Token Reduction task, our VQToken compresses sequences to just 0.07 percent of their original length while incurring only a 0.66 percent drop in accuracy on the NextQA-MC benchmark.
arXiv Detail & Related papers (2025-03-21T09:46:31Z) - Make Your Training Flexible: Towards Deployment-Efficient Video Models [22.727848052298427]
We propose a new test setting, Token Optimization, for maximized input information across budgets.<n>By making the sampling grid flexible and leveraging token selection, it is easily adopted in most popular video training frameworks.<n>We integrate Flux in large-scale video pre-training, and the resulting FluxViT establishes new state-of-the-art results across extensive tasks at standard costs.
arXiv Detail & Related papers (2025-03-18T13:15:58Z) - Fast Autoregressive Video Generation with Diagonal Decoding [34.90521536645348]
Diagonal Decoding (DiagD) is a training-free inference acceleration algorithm for autoregressively pre-trained models.<n>Our method generates tokens along diagonal paths in the spatial-temporal token grid, enabling parallel decoding within each frame.<n>DiagD achieves up to $10times$ speedup compared to naive sequential decoding, while maintaining comparable visual fidelity.
arXiv Detail & Related papers (2025-03-18T09:42:55Z) - Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction [93.69757398746017]
CoordTok is a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos.<n>CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled $(x,y,t)$ coordinates.
arXiv Detail & Related papers (2024-11-22T06:50:44Z) - LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior [36.663855554010674]
We present LARP, a novel video tokenizer designed to overcome limitations in current video tokenization methods for autoregressive (AR) generative models.
Unlike traditional patchwise tokenizers that directly encode local visual patches into discrete tokens, LARP introduces a holistic tokenization scheme.
It captures more global and semantic representations, rather than being limited to local patch-level information.
arXiv Detail & Related papers (2024-10-28T17:57:07Z) - ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens.<n>During inference, ElasticTok can dynamically allocate tokens when needed.<n>Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.