Next Block Prediction: Video Generation via Semi-Autoregressive Modeling
- URL: http://arxiv.org/abs/2502.07737v2
- Date: Wed, 12 Feb 2025 14:50:50 GMT
- Title: Next Block Prediction: Video Generation via Semi-Autoregressive Modeling
- Authors: Shuhuai Ren, Shuming Ma, Xu Sun, Furu Wei,
- Abstract summary: Next-Block Prediction (NBP) is a semi-autoregressive (semi-AR) framework for video generation.
NBP employs bidirectional attention within each block, enabling tokens to capture more robust spatial dependencies.
Our model achieves FVD scores of 103.3 on UCF101 and 25.5 on K600, outperforming the vanilla NTP model by an average of 4.4.
- Score: 92.60177942930946
- License:
- Abstract: Next-Token Prediction (NTP) is a de facto approach for autoregressive (AR) video generation, but it suffers from suboptimal unidirectional dependencies and slow inference speed. In this work, we propose a semi-autoregressive (semi-AR) framework, called Next-Block Prediction (NBP), for video generation. By uniformly decomposing video content into equal-sized blocks (e.g., rows or frames), we shift the generation unit from individual tokens to blocks, allowing each token in the current block to simultaneously predict the corresponding token in the next block. Unlike traditional AR modeling, our framework employs bidirectional attention within each block, enabling tokens to capture more robust spatial dependencies. By predicting multiple tokens in parallel, NBP models significantly reduce the number of generation steps, leading to faster and more efficient inference. Our model achieves FVD scores of 103.3 on UCF101 and 25.5 on K600, outperforming the vanilla NTP model by an average of 4.4. Furthermore, thanks to the reduced number of inference steps, the NBP model generates 8.89 frames (128x128 resolution) per second, achieving an 11x speedup. We also explored model scales ranging from 700M to 3B parameters, observing significant improvements in generation quality, with FVD scores dropping from 103.3 to 55.3 on UCF101 and from 25.5 to 19.5 on K600, demonstrating the scalability of our approach.
Related papers
- Parallelized Autoregressive Visual Generation [65.9579525736345]
We propose a simple yet effective approach for parallelized autoregressive visual generation.
Our method achieves a 3.6x speedup with comparable quality and up to 9.5x speedup with minimal quality degradation across both image and video generation tasks.
arXiv Detail & Related papers (2024-12-19T17:59:54Z) - Autoregressive Video Generation without Vector Quantization [90.87907377618747]
We reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction.
With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA.
Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity.
arXiv Detail & Related papers (2024-12-18T18:59:53Z) - From Slow Bidirectional to Fast Autoregressive Video Diffusion Models [52.32078428442281]
Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies.
We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly.
Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models.
arXiv Detail & Related papers (2024-12-10T18:59:50Z) - Efficient Continuous Video Flow Model for Video Prediction [43.16308241800144]
Multi-step prediction models, such as diffusion and rectified flow models, exhibit higher latency in sampling new frames compared to single-step methods.
We propose a novel approach to modeling the multi-step process, aimed at alleviating latency constraints and facilitating the adaptation of such processes for video prediction tasks.
arXiv Detail & Related papers (2024-12-07T12:11:25Z) - ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler [53.98558445900626]
Current image-to-video diffusion models, while powerful in generating videos from a single frame, need adaptation for two-frame conditioned generation.
We introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning.
Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames.
arXiv Detail & Related papers (2024-10-08T03:01:54Z) - Realizing Unaligned Block-wise Pruning for DNN Acceleration on Mobile Devices [1.6114012813668932]
Block-wise pruning is promising due to its low accuracy drop tradeoff for speedup gains.
Unaligned block pruning (UBP) addresses this by allowing blocks to be selected at arbitrary positions.
We propose a pseudo-optimal yet fast block selection algorithm called Block Expansion and Division.
arXiv Detail & Related papers (2024-07-29T01:59:06Z) - Generating Videos with Dynamics-aware Implicit Generative Adversarial
Networks [68.93429034530077]
We propose dynamics-aware implicit generative adversarial network (DIGAN) for video generation.
We show that DIGAN can be trained on 128 frame videos of 128x128 resolution, 80 frames longer than the 48 frames of the previous state-of-the-art method.
arXiv Detail & Related papers (2022-02-21T23:24:01Z) - YOLO-ReT: Towards High Accuracy Real-time Object Detection on Edge GPUs [14.85882314822983]
In order to map deep neural network (DNN) based object detection models to edge devices, one typically needs to compress such models significantly.
In this paper, we propose a novel edge GPU friendly module for multi-scale feature interaction.
We also propose a novel learning backbone adoption inspired by the changing translational information flow across various tasks.
arXiv Detail & Related papers (2021-10-26T14:02:59Z) - Gradient Forward-Propagation for Large-Scale Temporal Video Modelling [13.665160620951777]
Backpropagation blocks computations until the forward and backward passes are completed.
For temporal signals, this introduces high latency and hinders real-time learning.
In this paper, we build upon Sideways, which avoids blocking by propagating approximate gradients forward in time.
We show how to decouple computation and delegate individual neural modules to different devices, allowing distributed and parallel training.
arXiv Detail & Related papers (2021-06-15T17:50:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.