LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior
- URL: http://arxiv.org/abs/2410.21264v1
- Date: Mon, 28 Oct 2024 17:57:07 GMT
- Title: LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior
- Authors: Hanyu Wang, Saksham Suri, Yixuan Ren, Hao Chen, Abhinav Shrivastava,
- Abstract summary: We present LARP, a novel video tokenizer designed to overcome limitations in current video tokenization methods for autoregressive (AR) generative models.
Unlike traditional patchwise tokenizers that directly encode local visual patches into discrete tokens, LARP introduces a holistic tokenization scheme.
It captures more global and semantic representations, rather than being limited to local patch-level information.
- Score: 36.663855554010674
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present LARP, a novel video tokenizer designed to overcome limitations in current video tokenization methods for autoregressive (AR) generative models. Unlike traditional patchwise tokenizers that directly encode local visual patches into discrete tokens, LARP introduces a holistic tokenization scheme that gathers information from the visual content using a set of learned holistic queries. This design allows LARP to capture more global and semantic representations, rather than being limited to local patch-level information. Furthermore, it offers flexibility by supporting an arbitrary number of discrete tokens, enabling adaptive and efficient tokenization based on the specific requirements of the task. To align the discrete token space with downstream AR generation tasks, LARP integrates a lightweight AR transformer as a training-time prior model that predicts the next token on its discrete latent space. By incorporating the prior model during training, LARP learns a latent space that is not only optimized for video reconstruction but is also structured in a way that is more conducive to autoregressive generation. Moreover, this process defines a sequential order for the discrete tokens, progressively pushing them toward an optimal configuration during training, ensuring smoother and more accurate AR generation at inference time. Comprehensive experiments demonstrate LARP's strong performance, achieving state-of-the-art FVD on the UCF101 class-conditional video generation benchmark. LARP enhances the compatibility of AR models with videos and opens up the potential to build unified high-fidelity multimodal large language models (MLLMs).
Related papers
- Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space [84.22182151122598]
AdapTok is an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content.<n>AdapTok consistently improves reconstruction quality and generation performance under different token budgets.
arXiv Detail & Related papers (2025-05-22T17:59:02Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.
We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction [52.09472099976885]
IAR is an Improved AutoRegressive Visual Generation Method that enhances the training efficiency and generation quality of LLM-based visual generation models.
Our method consistently enhances the model training efficiency and performance from 100M to 1.4B, reducing the training time by half while achieving the same FID.
arXiv Detail & Related papers (2025-01-01T15:58:51Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning [17.614980614656407]
We propose Continual Generative training for Incremental prompt-Learning.
We exploit Variational Autoencoders to learn class-conditioned distributions.
We show that such a generative replay approach can adapt to new tasks while improving zero-shot capabilities.
arXiv Detail & Related papers (2024-07-22T16:51:28Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Adaptive Draft-Verification for Efficient Large Language Model Decoding [24.347886232342862]
Large language model (LLM) decoding involves generating a sequence of tokens based on a given context.
The typical autoregressive decoding method requires a separate forward pass through the model for each token generated.
We introduce ADED, which accelerates LLM decoding without requiring fine-tuning.
arXiv Detail & Related papers (2024-06-27T22:20:39Z) - Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding [11.832919020149891]
This research aims to accelerate the inference speed of large language models (LLMs) with billions of parameters.
We propose textbfSmart textbfParallel textbfAuto-textbfCorrect dtextbfEcoding (SPACE)
arXiv Detail & Related papers (2024-02-19T03:39:10Z) - Large Language Models as General Pattern Machines [64.75501424160748]
We show that pre-trained large language models (LLMs) are capable of autoregressively completing complex token sequences.
Surprisingly, pattern completion proficiency can be partially retained even when the sequences are expressed using tokens randomly sampled from the vocabulary.
In this work, we investigate how these zero-shot capabilities may be applied to problems in robotics.
arXiv Detail & Related papers (2023-07-10T17:32:13Z) - Progressive Fourier Neural Representation for Sequential Video
Compilation [75.43041679717376]
Motivated by continual learning, this work investigates how to accumulate and transfer neural implicit representations for multiple complex video data over sequential encoding sessions.
We propose a novel method, Progressive Fourier Neural Representation (PFNR), that aims to find an adaptive and compact sub-module in Fourier space to encode videos in each training session.
We validate our PFNR method on the UVG8/17 and DAVIS50 video sequence benchmarks and achieve impressive performance gains over strong continual learning baselines.
arXiv Detail & Related papers (2023-06-20T06:02:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.