FlashI2V: Fourier-Guided Latent Shifting Prevents Conditional Image Leakage in Image-to-Video Generation
- URL: http://arxiv.org/abs/2509.25187v1
- Date: Mon, 29 Sep 2025 17:59:53 GMT
- Title: FlashI2V: Fourier-Guided Latent Shifting Prevents Conditional Image Leakage in Image-to-Video Generation
- Authors: Yunyang Ge, Xinhua Cheng, Chengshu Zhao, Xianyi He, Shenghai Yuan, Bin Lin, Bin Zhu, Li Yuan,
- Abstract summary: In Image-to-Video (I2V) generation, a video is created using an input image as the first-frame condition.<n> conditional image leakage leads to performance degradation issues such as slow motion and color inconsistency.<n>We introduce Fourier-Guided Latent Shifting I2V, named FlashI2V, to prevent conditional image leakage.<n>With only 1.3B parameters, FlashI2V achieves a dynamic degree score of 53.01 on Vbench-I2V, surpassing CogVideoX1.5-5B-I2V and Wan2.1-I2V-14B-480P.
- Score: 41.86273464335121
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In Image-to-Video (I2V) generation, a video is created using an input image as the first-frame condition. Existing I2V methods concatenate the full information of the conditional image with noisy latents to achieve high fidelity. However, the denoisers in these methods tend to shortcut the conditional image, which is known as conditional image leakage, leading to performance degradation issues such as slow motion and color inconsistency. In this work, we further clarify that conditional image leakage leads to overfitting to in-domain data and decreases the performance in out-of-domain scenarios. Moreover, we introduce Fourier-Guided Latent Shifting I2V, named FlashI2V, to prevent conditional image leakage. Concretely, FlashI2V consists of: (1) Latent Shifting. We modify the source and target distributions of flow matching by subtracting the conditional image information from the noisy latents, thereby incorporating the condition implicitly. (2) Fourier Guidance. We use high-frequency magnitude features obtained by the Fourier Transform to accelerate convergence and enable the adjustment of detail levels in the generated video. Experimental results show that our method effectively overcomes conditional image leakage and achieves the best generalization and performance on out-of-domain data among various I2V paradigms. With only 1.3B parameters, FlashI2V achieves a dynamic degree score of 53.01 on Vbench-I2V, surpassing CogVideoX1.5-5B-I2V and Wan2.1-I2V-14B-480P. Github page: https://pku-yuangroup.github.io/FlashI2V/
Related papers
- Enhancing Motion Dynamics of Image-to-Video Models via Adaptive Low-Pass Guidance [70.12690940725092]
adaptive low-pass guidance (ALG) is a simple fix to the I2V model sampling procedure to generate more dynamic videos.<n>Under VBench-I2V test suite, ALG achieves an average improvement of 36% in dynamic degree without a significant drop in video quality or image fidelity.
arXiv Detail & Related papers (2025-06-10T05:23:46Z) - NOFT: Test-Time Noise Finetune via Information Bottleneck for Highly Correlated Asset Creation [70.96827354717459]
diffusion model has provided a strong tool for implementing text-to-image (T2I) and image-to-image (I2I) generation.<n>We propose a noise finetune NOFT module employed by Stable Diffusion to generate highly correlated and diverse images.
arXiv Detail & Related papers (2025-05-18T05:09:47Z) - FrameBridge: Improving Image-to-Video Generation with Bridge Models [21.888786343816875]
Diffusion models have achieved remarkable progress on image-to-video (I2V) generation.<n>Their noise-to-data generation process is inherently mismatched with this task, which may lead to suboptimal synthesis quality.<n>By modeling the frame-to-frames generation process with a bridge model based data-to-data generative process, we are able to fully exploit the information contained in the given image.
arXiv Detail & Related papers (2024-10-20T12:10:24Z) - Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model [31.70050311326183]
Diffusion models tend to generate videos with less motion than expected.
We address this issue from both inference and training aspects.
Our methods outperform baselines by producing higher motion scores with lower errors.
arXiv Detail & Related papers (2024-06-22T04:56:16Z) - TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models [40.38379402600541]
TI2V-Zero is a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image.
To guide video generation with the additional image input, we propose a "repeat-and-slide" strategy that modulates the reverse denoising process.
We conduct comprehensive experiments on both domain-specific and open-domain datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V model.
arXiv Detail & Related papers (2024-04-25T03:21:11Z) - VideoElevator: Elevating Video Generation Quality with Versatile
Text-to-Image Diffusion Models [94.25084162939488]
Text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment.
We introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I.
arXiv Detail & Related papers (2024-03-08T16:44:54Z) - Tuning-Free Noise Rectification for High Fidelity Image-to-Video
Generation [23.81997037880116]
Image-to-video (I2V) generation tasks always suffer from keeping high fidelity in the open domains.
Several recent I2V frameworks can generate dynamic content for open domain images but fail to maintain fidelity.
We propose an effective method that can be applied to mainstream video diffusion models.
arXiv Detail & Related papers (2024-03-05T09:57:47Z) - FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video
Synthesis [66.2611385251157]
Diffusion models have transformed the image-to-image (I2I) synthesis and are now permeating into videos.
This paper proposes a consistent V2V synthesis framework by jointly leveraging spatial conditions and temporal optical flow clues within the source video.
arXiv Detail & Related papers (2023-12-29T16:57:12Z) - I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models [80.32562822058924]
Text-guided image-to-video (I2V) generation aims to generate a coherent video that preserves the identity of the input image.
I2V-Adapter adeptly propagates the unnoised input image to subsequent noised frames through a cross-frame attention mechanism.
Our experimental results demonstrate that I2V-Adapter is capable of producing high-quality videos.
arXiv Detail & Related papers (2023-12-27T19:11:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.