Related papers: FrameBridge: Improving Image-to-Video Generation with Bridge Models

FrameBridge: Improving Image-to-Video Generation with Bridge Models

URL: http://arxiv.org/abs/2410.15371v2
Date: Mon, 16 Jun 2025 07:22:26 GMT
Title: FrameBridge: Improving Image-to-Video Generation with Bridge Models
Authors: Yuji Wang, Zehua Chen, Xiaoyu Chen, Yixiang Wei, Jun Zhu, Jianfei Chen,
Abstract summary: Diffusion models have achieved remarkable progress on image-to-video (I2V) generation.<n>Their noise-to-data generation process is inherently mismatched with this task, which may lead to suboptimal synthesis quality.<n>By modeling the frame-to-frames generation process with a bridge model based data-to-data generative process, we are able to fully exploit the information contained in the given image.
Score: 21.888786343816875
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models have achieved remarkable progress on image-to-video (I2V) generation, while their noise-to-data generation process is inherently mismatched with this task, which may lead to suboptimal synthesis quality. In this work, we present FrameBridge. By modeling the frame-to-frames generation process with a bridge model based data-to-data generative process, we are able to fully exploit the information contained in the given image and improve the consistency between the generation process and I2V task. Moreover, we propose two novel techniques toward the two popular settings of training I2V models, respectively. Firstly, we propose SNR-Aligned Fine-tuning (SAF), making the first attempt to fine-tune a diffusion model to a bridge model and, therefore, allowing us to utilize the pre-trained diffusion-based text-to-video (T2V) models. Secondly, we propose neural prior, further improving the synthesis quality of FrameBridge when training from scratch. Experiments conducted on WebVid-2M and UCF-101 demonstrate the superior quality of FrameBridge in comparison with the diffusion counterpart (zero-shot FVD 95 vs. 192 on MSR-VTT and non-zero-shot FVD 122 vs. 171 on UCF-101), and the advantages of our proposed SAF and neural prior for bridge-based I2V models. The project page: https://framebridge-icml.github.io/.

Related papers

Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis [14.980220974022982]
We introduce EVS, a training-free Encapsulated Video Synthesizer that composes T2I and T2V models to enhance both visual fidelity and motion smoothness.<n>Our approach utilizes a well-trained diffusion-based T2I model to refine low-quality video frames.<n>We also employ T2V backbones to ensure consistent motion dynamics.
arXiv Detail & Related papers (2025-07-18T08:59:02Z)
Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction [36.82594554832902]
Text-video prediction (TVP) is a downstream video generation task that requires a model to produce subsequent video frames. We propose an adaptation-based strategy we label Frame-wise Conditioning Adaptation (FCA) We use FCA to fine-tune the T2V model, which incorporates the initial frame(s) as an extra condition.
arXiv Detail & Related papers (2025-03-17T09:06:21Z)
From Slow Bidirectional to Fast Autoregressive Video Diffusion Models [52.32078428442281]
Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies.<n>We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly.<n>Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models.
arXiv Detail & Related papers (2024-12-10T18:59:50Z)
STIV: Scalable Text and Image Conditioned Video Generation [84.2574247093223]
We present a simple and scalable text-image-conditioned video generation method, named STIV. Our framework integrates image condition into a Diffusion Transformer (DiT) through frame replacement, while incorporating text conditioning. STIV can be easily extended to various applications, such as video prediction, frame, multi-view generation, and long video generation.
arXiv Detail & Related papers (2024-12-10T18:27:06Z)
ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler [53.98558445900626]
Current image-to-video diffusion models, while powerful in generating videos from a single frame, need adaptation for two-frame conditioned generation.<n>We introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning.<n>Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames.
arXiv Detail & Related papers (2024-10-08T03:01:54Z)
VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide [48.22321420680046]
VideoGuide is a novel framework that enhances the temporal consistency of pretrained text-to-video (T2V) models. It improves temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process. The proposed method brings about significant improvement in temporal consistency and image fidelity.
arXiv Detail & Related papers (2024-10-06T05:46:17Z)
Let Video Teaches You More: Video-to-Image Knowledge Distillation using DEtection TRansformer for Medical Video Lesion Detection [91.97935118185]
We propose Video-to-Image knowledge distillation for the task of medical video lesion detection. By distilling multi-frame contexts into a single frame, the proposed V2I-DETR combines the advantages of utilizing temporal contexts from video-based models and the inference speed of image-based models. V2I-DETR outperforms previous state-of-the-art methods by a large margin while achieving the real-time inference speed (30 FPS) as the image-based model.
arXiv Detail & Related papers (2024-08-26T07:17:05Z)
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction. Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task. We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z)
TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models [40.38379402600541]
TI2V-Zero is a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image. To guide video generation with the additional image input, we propose a "repeat-and-slide" strategy that modulates the reverse denoising process. We conduct comprehensive experiments on both domain-specific and open-domain datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V model.
arXiv Detail & Related papers (2024-04-25T03:21:11Z)
I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models [80.32562822058924]
Text-guided image-to-video (I2V) generation aims to generate a coherent video that preserves the identity of the input image. I2V-Adapter adeptly propagates the unnoised input image to subsequent noised frames through a cross-frame attention mechanism. Our experimental results demonstrate that I2V-Adapter is capable of producing high-quality videos.
arXiv Detail & Related papers (2023-12-27T19:11:50Z)
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models [54.99771394322512]
Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. It still challenges encounters in terms of semantic accuracy, clarity, and continuity-temporal continuity. We propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors. I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos.
arXiv Detail & Related papers (2023-11-07T17:16:06Z)
Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models [52.93036326078229]
Off-the-shelf billion-scale datasets for image generation are available, but collecting similar video data of the same scale is still challenging. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. Our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks.
arXiv Detail & Related papers (2023-05-17T17:59:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.