Related papers: End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

URL: http://arxiv.org/abs/2512.15702v1
Date: Wed, 17 Dec 2025 18:53:29 GMT
Title: End-to-End Training for Autoregressive Video Diffusion via Self-Resampling
Authors: Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, Dahua Lin,
Abstract summary: Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch.<n>We introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale.
Score: 63.84672807009907
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or online discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant history frames for each query. Experiments demonstrate that our approach achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos owing to native-length training.

Related papers

LIVE: Long-horizon Interactive Video World Modeling [39.52605866460851]
Long-horizon Interactive Video world modEl enforces bounded error accumulation via a novel cycle-consistency objective.<n>Live achieves state-of-the-art performance on long-horizon benchmarks, generating stable, high-quality videos far beyond training rollout lengths.
arXiv Detail & Related papers (2026-02-03T17:10:03Z)
BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models [50.986189632485285]
We introduce Backwards Aggregation (BAgger), a self-supervised scheme that constructs corrective trajectories from the model's own rollouts.<n>Unlike prior approaches that rely on few-step distillation and distribution-matching losses, BAgger trains with standard score or flow matching objectives.<n>We instantiate BAgger on causal diffusion transformers and evaluate on text-to-video, video extension, and multi-prompt generation.
arXiv Detail & Related papers (2025-12-12T23:02:02Z)
EvDiff: High Quality Video with an Event Camera [77.07279880903009]
Reconstructing intensity images from events is a highly ill-posed task due to the inherent ambiguity of absolute brightness.<n>We propose EvDiff, an event-based diffusion model that follows a surrogate training framework to produce high-quality videos.
arXiv Detail & Related papers (2025-11-21T18:49:18Z)
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [67.94300151774085]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z)
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z)
AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion [19.98565541640125]
We introduce Auto-Regressive Diffusion (AR-Diffusion), a novel model that combines the strengths of auto-regressive and diffusion models for flexible video generation.<n>Inspired by auto-regressive generation, we incorporate a non-decreasing constraint on the corruption timesteps of individual frames.<n>This setup, together with temporal causal attention, enables flexible generation of videos with varying lengths while preserving temporal coherence.
arXiv Detail & Related papers (2025-03-10T15:05:59Z)
Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model.<n>We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.<n>Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z)
Joint Diffusion models in Continual Learning [4.109431356659216]
We introduce JDCL - a new method for continual learning with generative rehearsal based on joint diffusion models.<n>Generative-replay-based continual learning methods try to mitigate this issue by retraining a model with a combination of new and rehearsal data sampled from a generative model.<n>We show that such shared parametrization, combined with the knowledge distillation technique allows for stable adaptation to new tasks without catastrophic forgetting.
arXiv Detail & Related papers (2024-11-12T22:35:44Z)
Lifelong Learning of Video Diffusion Models From a Single Video Stream [21.20227667908252]
This work demonstrates that autoregressive video diffusion models from a single video stream.<n> Lifelong Bouncing Drive consists of one consecutive datasets, and 3 million frames.
arXiv Detail & Related papers (2024-06-07T10:32:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.