Related papers: Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

URL: http://arxiv.org/abs/2305.10474v3
Date: Tue, 26 Mar 2024 01:11:52 GMT
Title: Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models
Authors: Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, Yogesh Balaji,
Abstract summary: Off-the-shelf billion-scale datasets for image generation are available, but collecting similar video data of the same scale is still challenging. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. Our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks.
Score: 52.93036326078229
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is computationally much more expensive than its image counterpart. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. We find that naively extending the image noise prior to video noise prior in video diffusion leads to sub-optimal performance. Our carefully designed video noise prior leads to substantially better performance. Extensive experimental validation shows that our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It also achieves SOTA video generation quality on the small-scale UCF-101 benchmark with a $10\times$ smaller model using significantly less computation than the prior art.

Related papers

AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset [55.82208863521353]
We propose AccVideo to reduce the inference steps for accelerating video diffusion models with synthetic dataset. Our model achieves 8.5x improvements in generation speed compared to the teacher model. Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution.
arXiv Detail & Related papers (2025-03-25T08:52:07Z)
FrameBridge: Improving Image-to-Video Generation with Bridge Models [23.19370431940568]
Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis. We present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them. We propose two techniques, SNR- Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively.
arXiv Detail & Related papers (2024-10-20T12:10:24Z)
Fine-gained Zero-shot Video Sampling [21.42513407755273]
We propose a novel Zero-Shot video sampling algorithm, denoted as $mathcalZS2$. $mathcalZS2$ is capable of directly sampling high-quality video clips without any training or optimization. It achieves state-of-the-art performance in zero-shot video generation, occasionally outperforming recent supervised methods.
arXiv Detail & Related papers (2024-07-31T09:36:58Z)
Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model [31.70050311326183]
Diffusion models tend to generate videos with less motion than expected. We address this issue from both inference and training aspects. Our methods outperform baselines by producing higher motion scores with lower errors.
arXiv Detail & Related papers (2024-06-22T04:56:16Z)
SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models. Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv Detail & Related papers (2024-06-06T17:58:27Z)
Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition [124.41196697408627]
We propose content-motion latent diffusion model (CMD), a novel efficient extension of pretrained image diffusion models for video generation. CMD encodes a video as a combination of a content frame (like an image) and a low-dimensional motion latent representation. We generate the content frame by fine-tuning a pretrained image diffusion model, and we generate the motion latent representation by training a new lightweight diffusion model.
arXiv Detail & Related papers (2024-03-21T05:48:48Z)
DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance [69.0740091741732]
We propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo. Our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge.
arXiv Detail & Related papers (2023-12-05T03:16:31Z)
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models [71.11425812806431]
Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands. Here, we apply the LDM paradigm to high-resolution generation, a particularly resource-intensive task. We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling.
arXiv Detail & Related papers (2023-04-18T08:30:32Z)
Video Diffusion Models [47.99413440461512]
Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We propose a diffusion model for video generation that shows very promising initial results. We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark.
arXiv Detail & Related papers (2022-04-07T14:08:02Z)
Unified Image and Video Saliency Modeling [21.701431656717112]
We ask: Can image and video saliency modeling be approached via a unified model? We propose four novel domain adaptation techniques and an improved formulation of learned Gaussian priors. We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data. We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300.
arXiv Detail & Related papers (2020-03-11T18:28:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.