Related papers: FastInit: Fast Noise Initialization for Temporally Consistent Video Generation

FastInit: Fast Noise Initialization for Temporally Consistent Video Generation

URL: http://arxiv.org/abs/2506.16119v1
Date: Thu, 19 Jun 2025 08:11:45 GMT
Title: FastInit: Fast Noise Initialization for Temporally Consistent Video Generation
Authors: Chengyu Bai, Yuming Li, Zhongyu Zhao, Jintao Chen, Peidong Jia, Qi She, Ming Lu, Shanghang Zhang,
Abstract summary: We introduce FastInit, a method that eliminates the need for iterative refinement during inference.<n>FastInit greatly enhances the efficiency of video generation while achieving high temporal consistency across frames.
Score: 27.825641236811887
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video generation has made significant strides with the development of diffusion models; however, achieving high temporal consistency remains a challenging task. Recently, FreeInit identified a training-inference gap and introduced a method to iteratively refine the initial noise during inference. However, iterative refinement significantly increases the computational cost associated with video generation. In this paper, we introduce FastInit, a fast noise initialization method that eliminates the need for iterative refinement. FastInit learns a Video Noise Prediction Network (VNPNet) that takes random noise and a text prompt as input, generating refined noise in a single forward pass. Therefore, FastInit greatly enhances the efficiency of video generation while achieving high temporal consistency across frames. To train the VNPNet, we create a large-scale dataset consisting of pairs of text prompts, random noise, and refined noise. Extensive experiments with various text-to-video models show that our method consistently improves the quality and temporal consistency of the generated videos. FastInit not only provides a substantial improvement in video generation but also offers a practical solution that can be applied directly during inference. The code and dataset will be released.

Related papers

READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation [55.58089937219475]
We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
arXiv Detail & Related papers (2025-08-05T13:57:03Z)
Video-T1: Test-Time Scaling for Video Generation [19.089876374170167]
Researchers in Large Language Models (LLMs) have expanded the scaling to test-time.<n>We aim to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt.<n>Experiments on text-conditioned video generation benchmarks demonstrate that increasing test-time compute consistently leads to significant improvements in the quality of videos.
arXiv Detail & Related papers (2025-03-24T17:59:04Z)
ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos [41.45750971432533]
Video diffusion models (VDMs) facilitate the generation of high-quality videos.<n>Recent studies have uncovered the existence of "golden noises" that can enhance video quality during generation.<n>We propose ScalingNoise, a plug-and-play inference-time search strategy that identifies golden initial noises for the diffusion sampling process.
arXiv Detail & Related papers (2025-03-20T17:54:37Z)
Generative Video Bi-flow [14.053608981988793]
We propose a novel generative video model by robustly learning temporal change as a neural Ordinary Differential Equation (ODE) flow.<n>We demonstrate unconditional video generation in a streaming manner for various video datasets.
arXiv Detail & Related papers (2025-03-09T00:03:59Z)
Enhancing Multi-Text Long Video Generation Consistency without Tuning: Time-Frequency Analysis, Prompt Alignment, and Theory [92.1714656167712]
We propose a temporal Attention Reweighting Algorithm (TiARA) to enhance the consistency and coherence of videos generated with either single or multiple prompts.<n>Our method is supported by a theoretical guarantee, the first-of-its-kind for frequency-based methods in diffusion models.<n>For videos generated by multiple prompts, we further investigate key factors affecting prompt quality and propose PromptBlend, an advanced video prompt pipeline.
arXiv Detail & Related papers (2024-12-23T03:56:27Z)
Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding [61.89781979702939]
This study quantitatively reveals an "impossible trinity" among data quantity, diversity, and quality in pre-training datasets. Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations. We introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods.
arXiv Detail & Related papers (2024-09-29T03:33:35Z)
Denoising Reuse: Exploiting Inter-frame Motion Consistency for Efficient Video Latent Generation [36.098738197088124]
This work presents a Diffusion Reuse MOtion network to accelerate latent video generation. coarse-grained noises in earlier denoising steps have demonstrated high motion consistency across consecutive video frames. Dr. Mo propagates those coarse-grained noises onto the next frame by incorporating carefully designed, lightweight inter-frame motions.
arXiv Detail & Related papers (2024-09-19T07:50:34Z)
SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models. Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv Detail & Related papers (2024-06-06T17:58:27Z)
POS: A Prompts Optimization Suite for Augmenting Text-to-Video Generation [11.556147036111222]
This paper aims to enhance the diffusion-based text-to-video generation by improving the two input prompts, including the noise and the text. We propose POS, a training-free Prompt Optimization Suite to boost text-to-video models.
arXiv Detail & Related papers (2023-11-02T02:33:09Z)
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling [85.60543452539076]
Existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. This study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. We propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models.
arXiv Detail & Related papers (2023-10-23T17:59:58Z)
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis. Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.