Taming Teacher Forcing for Masked Autoregressive Video Generation
- URL: http://arxiv.org/abs/2501.12389v1
- Date: Tue, 21 Jan 2025 18:59:31 GMT
- Title: Taming Teacher Forcing for Masked Autoregressive Video Generation
- Authors: Deyu Zhou, Quan Sun, Yuang Peng, Kun Yan, Runpei Dong, Duomin Wang, Zheng Ge, Nan Duan, Xiangyu Zhang, Lionel M. Ni, Heung-Yeung Shum,
- Abstract summary: We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation.<n>Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones. CTF significantly outperforms MTF, achieving a +23% improvement in FVD scores on first-frame conditioned video prediction.<n>Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation.
- Score: 63.477471494341955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation. Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones (namely Masked Teacher Forcing, MTF), enabling a smooth transition from token-level (patch-level) to frame-level autoregressive generation. CTF significantly outperforms MTF, achieving a +23% improvement in FVD scores on first-frame conditioned video prediction. To address issues like exposure bias, we employ targeted training strategies, setting a new benchmark in autoregressive video generation. Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation.
Related papers
- CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation [52.0601996237501]
Chain-of-Frame (CoF) reasoning enables frame-by-frame visual inference.<n>CoF-T2I integrates CoF reasoning into text-to-image (T2I) generation via progressive visual refinement.<n>Experiments show that CoF-T2I significantly outperforms the base video model.
arXiv Detail & Related papers (2026-01-15T04:33:06Z) - MaskFlow: Discrete Flows For Flexible and Efficient Long Video Generation [25.721829124345106]
We introduce MaskFlow, a unified video generation framework that combines discrete representations with flow-matching.
By leveraging a frame-level masking strategy during training, MaskFlow conditions on previously generated unmasked frames to generate videos with lengths ten times beyond that of the training sequences.
We validate the quality of our method on the FaceForensics (FFS) and Deepmind Lab (DMLab) datasets and report Frechet Video Distance (FVD) competitive with state-of-the-art approaches.
arXiv Detail & Related papers (2025-02-16T18:59:11Z) - From Slow Bidirectional to Fast Autoregressive Video Diffusion Models [52.32078428442281]
Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies.<n>We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly.<n>Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models.
arXiv Detail & Related papers (2024-12-10T18:59:50Z) - Concatenated Masked Autoencoders as Spatial-Temporal Learner [6.475592804311682]
We introduce the Concatenated Masked Autoencoders (CatMAE) as a spatial-temporal learner for self-supervised video representation learning.
We propose a new data augmentation strategy, Video-Reverse (ViRe), which uses reversed video frames as the model's reconstruction targets.
arXiv Detail & Related papers (2023-11-02T03:08:26Z) - Unmasked Teacher: Towards Training-Efficient Video Foundation Models [50.19560876891811]
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity.
This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods.
Our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.
arXiv Detail & Related papers (2023-03-28T15:39:28Z) - Tell Me What Happened: Unifying Text-guided Video Completion via
Multimodal Masked Video Generation [82.26026492545533]
We introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction.
We then propose Multimodal Masked Video Generation (MMVG) to address this TVC task.
At inference time, a single MMVG model can address all 3 cases of TVC, including video prediction, rewind, and infilling, by applying corresponding masking conditions.
arXiv Detail & Related papers (2022-11-23T10:14:12Z) - Masked Conditional Video Diffusion for Prediction, Generation, and
Interpolation [14.631523634811392]
Masked Conditional Video Diffusion (MCVD) is a general-purpose framework for video prediction.
We train the model in a manner where we randomly and independently mask all the past frames or all the future frames.
Our approach yields SOTA results across standard video prediction benchmarks, with computation times measured in 1-12 days.
arXiv Detail & Related papers (2022-05-19T20:58:05Z) - Generating Videos with Dynamics-aware Implicit Generative Adversarial
Networks [68.93429034530077]
We propose dynamics-aware implicit generative adversarial network (DIGAN) for video generation.
We show that DIGAN can be trained on 128 frame videos of 128x128 resolution, 80 frames longer than the 48 frames of the previous state-of-the-art method.
arXiv Detail & Related papers (2022-02-21T23:24:01Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.