ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation
- URL: http://arxiv.org/abs/2402.04324v2
- Date: Mon, 1 Jul 2024 03:57:55 GMT
- Title: ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation
- Authors: Weiming Ren, Huan Yang, Ge Zhang, Cong Wei, Xinrun Du, Wenhao Huang, Wenhu Chen,
- Abstract summary: Image-to-video (I2V) generation aims to use the initial frame (alongside a text prompt) to create a video sequence.
Existing methods often struggle to preserve the integrity of the subject, background, and style from the first frame.
We propose ConsistI2V, a diffusion-based method to enhance visual consistency for I2V generation.
- Score: 37.05422543076405
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image-to-video (I2V) generation aims to use the initial frame (alongside a text prompt) to create a video sequence. A grand challenge in I2V generation is to maintain visual consistency throughout the video: existing methods often struggle to preserve the integrity of the subject, background, and style from the first frame, as well as ensure a fluid and logical progression within the video narrative. To mitigate these issues, we propose ConsistI2V, a diffusion-based method to enhance visual consistency for I2V generation. Specifically, we introduce (1) spatiotemporal attention over the first frame to maintain spatial and motion consistency, (2) noise initialization from the low-frequency band of the first frame to enhance layout consistency. These two approaches enable ConsistI2V to generate highly consistent videos. We also extend the proposed approaches to show their potential to improve consistency in auto-regressive long video generation and camera motion control. To verify the effectiveness of our method, we propose I2V-Bench, a comprehensive evaluation benchmark for I2V generation. Our automatic and human evaluation results demonstrate the superiority of ConsistI2V over existing methods.
Related papers
- StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text [58.49820807662246]
We introduce StreamingT2V, an autoregressive approach for long video generation of 80, 240, 600, 1200 or more frames with smooth transitions.
Our code will be available at: https://github.com/Picsart-AI-Research/StreamingT2V.
arXiv Detail & Related papers (2024-03-21T18:27:29Z) - VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis [18.806249040835624]
We introduce the concept of Generative Temporal Nursing (GTN), where we aim to alter the generative process on the fly during inference to improve control over the temporal dynamics.
We experimentally showcase the superiority of the proposed approach in generating longer, visually appealing videos over existing open-sourced T2V models.
arXiv Detail & Related papers (2024-03-20T10:58:58Z) - Motion-I2V: Consistent and Controllable Image-to-Video Generation with
Explicit Motion Modeling [62.19142543520805]
Motion-I2V is a framework for consistent and controllable image-to-video generation.
It factorizes I2V into two stages with explicit motion modeling.
Motion-I2V's second stage naturally supports zero-shot video-to-video translation.
arXiv Detail & Related papers (2024-01-29T09:06:43Z) - E2HQV: High-Quality Video Generation from Event Camera via
Theory-Inspired Model-Aided Deep Learning [53.63364311738552]
Bio-inspired event cameras or dynamic vision sensors are capable of capturing per-pixel brightness changes (called event-streams) in high temporal resolution and high dynamic range.
It calls for events-to-video (E2V) solutions which take event-streams as input and generate high quality video frames for intuitive visualization.
We propose textbfE2HQV, a novel E2V paradigm designed to produce high-quality video frames from events.
arXiv Detail & Related papers (2024-01-16T05:10:50Z) - I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models [80.32562822058924]
Text-guided image-to-video (I2V) generation aims to generate a coherent video that preserves the identity of the input image.
I2V-Adapter adeptly propagates the unnoised input image to subsequent noised frames through a cross-frame attention mechanism.
Our experimental results demonstrate that I2V-Adapter is capable of producing high-quality videos.
arXiv Detail & Related papers (2023-12-27T19:11:50Z) - ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation [33.37279673304]
We introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text.
ConditionVideo generates realistic dynamic videos from random noise or given scene videos.
Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods.
arXiv Detail & Related papers (2023-10-11T17:46:28Z) - Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps.
We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process.
Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.