Long-Context Autoregressive Video Modeling with Next-Frame Prediction
- URL: http://arxiv.org/abs/2503.19325v2
- Date: Thu, 17 Apr 2025 15:26:04 GMT
- Title: Long-Context Autoregressive Video Modeling with Next-Frame Prediction
- Authors: Yuchao Gu, Weijia Mao, Mike Zheng Shou,
- Abstract summary: We introduce Frame AutoRegressive (FAR), a strong baseline for video autoregressive modeling.<n>Building on FAR, we observe that long-context video modeling faces challenges due to visual redundancy.<n>We propose balancing locality and long-range dependency through long short-term context modeling.
- Score: 17.710915002557996
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Long-context autoregressive modeling has significantly advanced language generation, but video generation still struggles to fully utilize extended temporal contexts. To investigate long-context video modeling, we introduce Frame AutoRegressive (FAR), a strong baseline for video autoregressive modeling. Just as language models learn causal dependencies between tokens (i.e., Token AR), FAR models temporal causal dependencies between continuous frames, achieving better convergence than Token AR and video diffusion transformers. Building on FAR, we observe that long-context video modeling faces challenges due to visual redundancy. Training on long videos is computationally expensive, as vision tokens grow much faster than language tokens. To tackle this issue, we propose balancing locality and long-range dependency through long short-term context modeling. A high-resolution short-term context window ensures fine-grained temporal consistency, while an unlimited long-term context window encodes long-range information using fewer tokens. With this approach, we can train on long video sequences with a manageable token context length, thereby significantly reducing training time and memory usage. Furthermore, we propose a multi-level KV cache designed to support the long short-term context modeling, which accelerating inference on long video sequences. We demonstrate that FAR achieves state-of-the-art performance in both short- and long-video generation, providing a simple yet effective baseline for video autoregressive modeling. The code is released at https://github.com/showlab/FAR.
Related papers
- Multimodal Long Video Modeling Based on Temporal Dynamic Context [13.979661295432964]
We propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC)
We segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders.
To handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments.
arXiv Detail & Related papers (2025-04-14T17:34:06Z) - BIMBA: Selective-Scan Compression for Long-Range Video Question Answering [46.199493246921435]
Video Question Answering (VQA) in long videos poses the key challenge of extracting relevant information.
We introduce BIMBA, an efficient state-space model to handle long-form videos.
arXiv Detail & Related papers (2025-03-12T17:57:32Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - Long Context Transfer from Language to Vision [74.78422371545716]
Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos.
In this paper, we approach this problem from the perspective of the language model.
By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training.
arXiv Detail & Related papers (2024-06-24T17:58:06Z) - ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models [66.84478240757038]
A majority of video diffusion models (VDMs) generate long videos in an autoregressive manner, i.e., generating subsequent clips conditioned on last frames of previous clip.
We introduce causal (i.e., unidirectional) generation into VDMs, and use past frames as prompt to generate future frames.
Our ViD-GPT achieves state-of-the-art performance both quantitatively and qualitatively on long video generation.
arXiv Detail & Related papers (2024-06-16T15:37:22Z) - Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities [67.89368528234394]
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities.
Video and audio are obtained at much higher rates than text and are roughly aligned in time.
Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models.
arXiv Detail & Related papers (2023-11-09T19:15:12Z) - Long-Form Video-Language Pre-Training with Multimodal Temporal
Contrastive Learning [39.80936685227549]
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks.
We introduce a Long-Form VIdeo-LAnguage pre-training model (VILA) and train it on a large-scale long-form video and paragraph dataset.
We fine-tune the model on seven downstream long-form video-language understanding tasks, achieve new state-of-the-art performances.
arXiv Detail & Related papers (2022-10-12T09:08:27Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.