FastInit: Fast Noise Initialization for Temporally Consistent Video Generation
- URL: http://arxiv.org/abs/2506.16119v1
- Date: Thu, 19 Jun 2025 08:11:45 GMT
- Title: FastInit: Fast Noise Initialization for Temporally Consistent Video Generation
- Authors: Chengyu Bai, Yuming Li, Zhongyu Zhao, Jintao Chen, Peidong Jia, Qi She, Ming Lu, Shanghang Zhang,
- Abstract summary: We introduce FastInit, a method that eliminates the need for iterative refinement during inference.<n>FastInit greatly enhances the efficiency of video generation while achieving high temporal consistency across frames.
- Score: 27.825641236811887
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video generation has made significant strides with the development of diffusion models; however, achieving high temporal consistency remains a challenging task. Recently, FreeInit identified a training-inference gap and introduced a method to iteratively refine the initial noise during inference. However, iterative refinement significantly increases the computational cost associated with video generation. In this paper, we introduce FastInit, a fast noise initialization method that eliminates the need for iterative refinement. FastInit learns a Video Noise Prediction Network (VNPNet) that takes random noise and a text prompt as input, generating refined noise in a single forward pass. Therefore, FastInit greatly enhances the efficiency of video generation while achieving high temporal consistency across frames. To train the VNPNet, we create a large-scale dataset consisting of pairs of text prompts, random noise, and refined noise. Extensive experiments with various text-to-video models show that our method consistently improves the quality and temporal consistency of the generated videos. FastInit not only provides a substantial improvement in video generation but also offers a practical solution that can be applied directly during inference. The code and dataset will be released.
Related papers
- READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation [55.58089937219475]
We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
arXiv Detail & Related papers (2025-08-05T13:57:03Z) - Video-T1: Test-Time Scaling for Video Generation [19.089876374170167]
Researchers in Large Language Models (LLMs) have expanded the scaling to test-time.<n>We aim to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt.<n>Experiments on text-conditioned video generation benchmarks demonstrate that increasing test-time compute consistently leads to significant improvements in the quality of videos.
arXiv Detail & Related papers (2025-03-24T17:59:04Z) - ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos [41.45750971432533]
Video diffusion models (VDMs) facilitate the generation of high-quality videos.<n>Recent studies have uncovered the existence of "golden noises" that can enhance video quality during generation.<n>We propose ScalingNoise, a plug-and-play inference-time search strategy that identifies golden initial noises for the diffusion sampling process.
arXiv Detail & Related papers (2025-03-20T17:54:37Z) - Generative Video Bi-flow [14.053608981988793]
We propose a novel generative video model by robustly learning temporal change as a neural Ordinary Differential Equation (ODE) flow.<n>We demonstrate unconditional video generation in a streaming manner for various video datasets.
arXiv Detail & Related papers (2025-03-09T00:03:59Z) - Enhancing Multi-Text Long Video Generation Consistency without Tuning: Time-Frequency Analysis, Prompt Alignment, and Theory [92.1714656167712]
We propose a temporal Attention Reweighting Algorithm (TiARA) to enhance the consistency and coherence of videos generated with either single or multiple prompts.<n>Our method is supported by a theoretical guarantee, the first-of-its-kind for frequency-based methods in diffusion models.<n>For videos generated by multiple prompts, we further investigate key factors affecting prompt quality and propose PromptBlend, an advanced video prompt pipeline.
arXiv Detail & Related papers (2024-12-23T03:56:27Z) - Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding [61.89781979702939]
This study quantitatively reveals an "impossible trinity" among data quantity, diversity, and quality in pre-training datasets.
Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations.
We introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods.
arXiv Detail & Related papers (2024-09-29T03:33:35Z) - Denoising Reuse: Exploiting Inter-frame Motion Consistency for Efficient Video Latent Generation [36.098738197088124]
This work presents a Diffusion Reuse MOtion network to accelerate latent video generation.
coarse-grained noises in earlier denoising steps have demonstrated high motion consistency across consecutive video frames.
Dr. Mo propagates those coarse-grained noises onto the next frame by incorporating carefully designed, lightweight inter-frame motions.
arXiv Detail & Related papers (2024-09-19T07:50:34Z) - SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models.
Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv Detail & Related papers (2024-06-06T17:58:27Z) - POS: A Prompts Optimization Suite for Augmenting Text-to-Video Generation [11.556147036111222]
This paper aims to enhance the diffusion-based text-to-video generation by improving the two input prompts, including the noise and the text.
We propose POS, a training-free Prompt Optimization Suite to boost text-to-video models.
arXiv Detail & Related papers (2023-11-02T02:33:09Z) - FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling [85.60543452539076]
Existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference.
This study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts.
We propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models.
arXiv Detail & Related papers (2023-10-23T17:59:58Z) - VideoFusion: Decomposed Diffusion Models for High-Quality Video
Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis.
Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.