Related papers: Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation

Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation

URL: http://arxiv.org/abs/2508.03334v2
Date: Wed, 06 Aug 2025 08:21:19 GMT
Title: Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation
Authors: Xunzhi Xiang, Yabo Chen, Guiyu Zhang, Zhongyu Wang, Zhe Gao, Quanming Xiang, Gonghu Shang, Junqi Liu, Haibin Huang, Yang Gao, Chi Zhang, Qi Fan, Xuelong Li,
Abstract summary: Current autoregressive diffusion models excel at video generation but are generally limited to short temporal durations.<n>We propose a planning-then-populating framework centered on Macro-from-Micro Planning (MMPL) for long video generation.<n>MMPL sketches a global storyline for the entire video through two hierarchical stages: Micro Planning and Macro Planning.
Score: 50.42977813298953
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current autoregressive diffusion models excel at video generation but are generally limited to short temporal durations. Our theoretical analysis indicates that the autoregressive modeling typically suffers from temporal drift caused by error accumulation and hinders parallelization in long video synthesis. To address these limitations, we propose a novel planning-then-populating framework centered on Macro-from-Micro Planning (MMPL) for long video generation. MMPL sketches a global storyline for the entire video through two hierarchical stages: Micro Planning and Macro Planning. Specifically, Micro Planning predicts a sparse set of future keyframes within each short video segment, offering motion and appearance priors to guide high-quality video segment generation. Macro Planning extends the in-segment keyframes planning across the entire video through an autoregressive chain of micro plans, ensuring long-term consistency across video segments. Subsequently, MMPL-based Content Populating generates all intermediate frames in parallel across segments, enabling efficient parallelization of autoregressive generation. The parallelization is further optimized by Adaptive Workload Scheduling for balanced GPU execution and accelerated autoregressive video generation. Extensive experiments confirm that our method outperforms existing long video generation models in quality and stability. Generated videos and comparison results are in our project page.

Related papers

STAGE: A Stream-Centric Generative World Model for Long-Horizon Driving-Scene Simulation [24.86836673853292]
STAGE is an auto-regressive framework that pioneers hierarchical feature coordination and multiphase optimization for sustainable video synthesis.<n>HTFT enhances temporal consistency between video frames throughout the video generation process.<n>We generated 600 frames of high-quality driving videos on the Nuscenes dataset, which far exceeds the maximum length achievable by existing methods.
arXiv Detail & Related papers (2025-06-16T06:53:05Z)
InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO [73.33751812982342]
InfLVG is an inference-time framework that enables coherent long video generation without requiring additional long-form video data.<n>We show that InfLVG can extend video length by up to 9$times$, achieving strong consistency and semantic fidelity across scenes.
arXiv Detail & Related papers (2025-05-23T07:33:25Z)
SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content.<n>We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context.<n>Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z)
MarDini: Masked Autoregressive Diffusion for Video Generation at Scale [76.84820168294586]
MarDini is a new family of video diffusion models that integrate the advantages of masked auto-regression into a unified diffusion model (DM) framework. MarDini sets a new state-of-the-art for videogression; it efficiently generates videos on par with those of much more expensive advanced image-to-video models.
arXiv Detail & Related papers (2024-10-26T21:12:32Z)
Anchored Diffusion for Video Face Reenactment [17.343307538702238]
We introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos. We train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance. During inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame.
arXiv Detail & Related papers (2024-07-21T13:14:17Z)
Video Language Planning [137.06052217713054]
Video language planning is an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models. Our algorithm produces detailed multimodal (video and language) specifications that describe how to complete the final task. It substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots.
arXiv Detail & Related papers (2023-10-16T17:48:45Z)
Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding [57.917616284917756]
Real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length. This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative.
arXiv Detail & Related papers (2023-09-20T18:13:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.