Related papers: Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

URL: http://arxiv.org/abs/2512.21004v1
Date: Wed, 24 Dec 2025 07:07:08 GMT
Title: Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations
Authors: Jinghan Li, Yang Jin, Hao Jiang, Yadong Mu, Yang Song, Kun Xu,
Abstract summary: We propose NExT-Vid, a novel autoregressive visual generative pretraining framework.<n>We introduce a context-isolated autoregressive predictor to decouple semantic representation from target decoding.<n>Through context-isolated flow-matching pretraining, our approach achieves strong representations.
Score: 53.91818843831925
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in pretraining general foundation models have significantly improved performance across diverse downstream tasks. While autoregressive (AR) generative models like GPT have revolutionized NLP, most visual generative pretraining methods still rely on BERT-style masked modeling, which often disregards the temporal information essential for video analysis. The few existing autoregressive visual pretraining methods suffer from issues such as inaccurate semantic localization and poor generation quality, leading to poor semantics. In this work, we propose NExT-Vid, a novel autoregressive visual generative pretraining framework that utilizes masked next-frame prediction to jointly model images and videos. NExT-Vid introduces a context-isolated autoregressive predictor to decouple semantic representation from target decoding, and a conditioned flow-matching decoder to enhance generation quality and diversity. Through context-isolated flow-matching pretraining, our approach achieves strong representations. Extensive experiments on large-scale pretrained models demonstrate that our proposed method consistently outperforms previous generative pretraining methods for visual representation learning via attentive probing in downstream classification.

Related papers

DreamVAR: Taming Reinforced Visual Autoregressive Model for High-Fidelity Subject-Driven Image Generation [108.71044040025374]
We present a novel framework for subject-driven image synthesis built upon a Visual Autoregressive model that employs next-scale prediction.<n>We show that Dreamthe achieves superior appearance preservation compared to leading diffusion-based methods.
arXiv Detail & Related papers (2026-01-30T03:32:29Z)
Generative Pre-trained Autoregressive Diffusion Transformer [74.25668109048418]
GPDiT is a Generative Pre-trained Autoregressive Diffusion Transformer.<n>It unifies the strengths of diffusion and autoregressive modeling for long-range video synthesis.<n>It autoregressively predicts future latent frames using a diffusion loss, enabling natural modeling of motion dynamics.
arXiv Detail & Related papers (2025-05-12T08:32:39Z)
Autoregressive Video Generation without Vector Quantization [90.87907377618747]
We reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction.<n>With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA.<n>Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity.
arXiv Detail & Related papers (2024-12-18T18:59:53Z)
DepthART: Monocular Depth Estimation as Autoregressive Refinement Task [2.3884184860468136]
We introduce DepthART - a novel training method formulated as a Depth Autoregressive Refinement Task.<n>By utilizing the model's own predictions as inputs, we frame the objective as residual minimization, effectively reducing the discrepancy between training and inference procedures.<n>When trained on Hypersim dataset using our approach, the model achieves superior results across multiple unseen benchmarks compared to existing generative and discriminative baselines.
arXiv Detail & Related papers (2024-09-23T13:36:34Z)
Denoising Autoregressive Representation Learning [13.185567468951628]
Our method, DARL, employs a decoder-only Transformer to predict image patches autoregressively. We show that the learned representation can be improved by using tailored noise schedules and longer training in larger models.
arXiv Detail & Related papers (2024-03-08T10:19:00Z)
Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models. We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z)
Iterative autoregression: a novel trick to improve your low-latency speech enhancement model [2.2999148299770047]
Streaming models are an essential component of real-time speech enhancement tools. We propose a straightforward yet effective alternative technique for training autoregressive low-latency speech enhancement models.
arXiv Detail & Related papers (2022-11-03T12:32:33Z)
Improving Non-autoregressive Generation with Mixup Training [51.61038444990301]
We present a non-autoregressive generation model based on pre-trained transformer models. We propose a simple and effective iterative training method called MIx Source and pseudo Target. Our experiments on three generation benchmarks including question generation, summarization and paraphrase generation, show that the proposed framework achieves the new state-of-the-art results.
arXiv Detail & Related papers (2021-10-21T13:04:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.