Related papers: VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

URL: http://arxiv.org/abs/2410.04364v2
Date: Tue, 8 Oct 2024 08:56:24 GMT
Title: VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide
Authors: Dohun Lee, Bryan S Kim, Geon Yeong Park, Jong Chul Ye,
Abstract summary: VideoGuide is a novel framework that enhances the temporal consistency of pretrained text-to-video (T2V) models. It improves temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process. The proposed method brings about significant improvement in temporal consistency and image fidelity.
Score: 48.22321420680046
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-image (T2I) diffusion models have revolutionized visual content creation, but extending these capabilities to text-to-video (T2V) generation remains a challenge, particularly in preserving temporal consistency. Existing methods that aim to improve consistency often cause trade-offs such as reduced imaging quality and impractical computational time. To address these issues we introduce VideoGuide, a novel framework that enhances the temporal consistency of pretrained T2V models without the need for additional training or fine-tuning. Instead, VideoGuide leverages any pretrained video diffusion model (VDM) or itself as a guide during the early stages of inference, improving temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process. The proposed method brings about significant improvement in temporal consistency and image fidelity, providing a cost-effective and practical solution that synergizes the strengths of various video diffusion models. Furthermore, we demonstrate prior distillation, revealing that base models can achieve enhanced text coherence by utilizing the superior data prior of the guiding model through the proposed method. Project Page: https://dohunlee1.github.io/videoguide.github.io/

Related papers

Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis [14.980220974022982]
We introduce EVS, a training-free Encapsulated Video Synthesizer that composes T2I and T2V models to enhance both visual fidelity and motion smoothness.<n>Our approach utilizes a well-trained diffusion-based T2I model to refine low-quality video frames.<n>We also employ T2V backbones to ensure consistent motion dynamics.
arXiv Detail & Related papers (2025-07-18T08:59:02Z)
Dual-Expert Consistency Model for Efficient and High-Quality Video Generation [57.33788820909211]
We propose a parameter-efficient textbfDual-Expert Consistency Model(DCM), where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement.<n>Our approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation.
arXiv Detail & Related papers (2025-06-03T17:55:04Z)
Diffusion Adversarial Post-Training for One-Step Video Generation [26.14991703029242]
We propose Adversarial Post-Training (APT) against real data following diffusion pre-training for one-step video generation. Our model is capable of generating 1024px images in a single step, achieving quality comparable to state-of-the-art methods.
arXiv Detail & Related papers (2025-01-14T18:51:48Z)
FrameBridge: Improving Image-to-Video Generation with Bridge Models [23.19370431940568]
Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis. We present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them. We propose two techniques, SNR- Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively.
arXiv Detail & Related papers (2024-10-20T12:10:24Z)
HARIVO: Harnessing Text-to-Image Models for Video Generation [45.63338167699105]
We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth.
arXiv Detail & Related papers (2024-10-10T09:47:39Z)
SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models. Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv Detail & Related papers (2024-06-06T17:58:27Z)
Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation [59.184980778643464]
Fine-tuning Diffusion Models remains an underexplored frontier in generative artificial intelligence (GenAI) In this paper, we introduce an innovative technique called self-play fine-tuning for diffusion models (SPIN-Diffusion) Our approach offers an alternative to conventional supervised fine-tuning and RL strategies, significantly improving both model performance and alignment.
arXiv Detail & Related papers (2024-02-15T18:59:18Z)
InstructVideo: Instructing Video Diffusion Models with Human Feedback [65.9590462317474]
We propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing.
arXiv Detail & Related papers (2023-12-19T17:55:16Z)
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling. It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences. It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z)
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z)
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models [133.088893990272]
We learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. We propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models.
arXiv Detail & Related papers (2023-09-26T17:52:03Z)
Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps. We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process. Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.