APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency
- URL: http://arxiv.org/abs/2308.12605v2
- Date: Thu, 2 May 2024 01:07:49 GMT
- Title: APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency
- Authors: Yupu Yao, Shangqi Deng, Zihan Cao, Harry Zhang, Liang-Jian Deng,
- Abstract summary: We propose a novel text-to-video (T2V) generation network structure based on diffusion models.
Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks.
We leverage a hybrid architecture of transformers and convolutions to compensate for temporal intricacies, enhancing consistency between different frames within the video.
- Score: 9.07931905323022
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion models have exhibited promising progress in video generation. However, they often struggle to retain consistent details within local regions across frames. One underlying cause is that traditional diffusion models approximate Gaussian noise distribution by utilizing predictive noise, without fully accounting for the impact of inherent information within the input itself. Additionally, these models emphasize the distinction between predictions and references, neglecting information intrinsic to the videos. To address this limitation, inspired by the self-attention mechanism, we propose a novel text-to-video (T2V) generation network structure based on diffusion models, dubbed Additional Perturbation for Latent noise with Adversarial training (APLA). Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks. Notably, we introduce an additional compact network, known as the Video Generation Transformer (VGT). This auxiliary component is designed to extract perturbations from the inherent information contained within the input, thereby refining inconsistent pixels during temporal predictions. We leverage a hybrid architecture of transformers and convolutions to compensate for temporal intricacies, enhancing consistency between different frames within the video. Experiments demonstrate a noticeable improvement in the consistency of the generated videos both qualitatively and quantitatively.
Related papers
- Edge-preserving noise for diffusion models [4.435514696080208]
We present a novel edge-preserving diffusion model that is a generalization of denoising diffusion probablistic models (DDPM)
In particular, we introduce an edge-aware noise scheduler that varies between edge-preserving and isotropic Gaussian noise.
We show that our model's generative process converges faster to results that more closely match the target distribution.
arXiv Detail & Related papers (2024-10-02T13:29:52Z) - Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio.
We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - Training-Free Semantic Video Composition via Pre-trained Diffusion Model [96.0168609879295]
Current approaches, predominantly trained on videos with adjusted foreground color and lighting, struggle to address deep semantic disparities beyond superficial adjustments.
We propose a training-free pipeline employing a pre-trained diffusion model imbued with semantic prior knowledge.
Experimental results reveal that our pipeline successfully ensures the visual harmony and inter-frame coherence of the outputs.
arXiv Detail & Related papers (2024-01-17T13:07:22Z) - FreeInit: Bridging Initialization Gap in Video Diffusion Models [42.38240625514987]
FreeInit is able to compensate the gap between training and inference, thus effectively improving the subject appearance and temporal consistency of generation results.
Experiments demonstrate that FreeInit consistently enhances the generation quality of various text-to-video diffusion models without additional training or fine-tuning.
arXiv Detail & Related papers (2023-12-12T18:59:16Z) - Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World
Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling.
It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences.
It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z) - Diffusion Models as Masked Autoencoders [52.442717717898056]
We revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models.
While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE)
We perform a comprehensive study on the pros and cons of design choices and build connections between diffusion models and masked autoencoders.
arXiv Detail & Related papers (2023-04-06T17:59:56Z) - VideoFusion: Decomposed Diffusion Models for High-Quality Video
Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis.
Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.