ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation
- URL: http://arxiv.org/abs/2402.04324v2
- Date: Mon, 1 Jul 2024 03:57:55 GMT
- Title: ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation
- Authors: Weiming Ren, Huan Yang, Ge Zhang, Cong Wei, Xinrun Du, Wenhao Huang, Wenhu Chen,
- Abstract summary: Image-to-video (I2V) generation aims to use the initial frame (alongside a text prompt) to create a video sequence.
Existing methods often struggle to preserve the integrity of the subject, background, and style from the first frame.
We propose ConsistI2V, a diffusion-based method to enhance visual consistency for I2V generation.
- Score: 37.05422543076405
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image-to-video (I2V) generation aims to use the initial frame (alongside a text prompt) to create a video sequence. A grand challenge in I2V generation is to maintain visual consistency throughout the video: existing methods often struggle to preserve the integrity of the subject, background, and style from the first frame, as well as ensure a fluid and logical progression within the video narrative. To mitigate these issues, we propose ConsistI2V, a diffusion-based method to enhance visual consistency for I2V generation. Specifically, we introduce (1) spatiotemporal attention over the first frame to maintain spatial and motion consistency, (2) noise initialization from the low-frequency band of the first frame to enhance layout consistency. These two approaches enable ConsistI2V to generate highly consistent videos. We also extend the proposed approaches to show their potential to improve consistency in auto-regressive long video generation and camera motion control. To verify the effectiveness of our method, we propose I2V-Bench, a comprehensive evaluation benchmark for I2V generation. Our automatic and human evaluation results demonstrate the superiority of ConsistI2V over existing methods.
Related papers
- BachVid: Training-Free Video Generation with Consistent Background and Character [62.46376250180513]
Diffusion Transformers (DiTs) have recently driven significant progress in text-to-video (T2V) generation.<n>Existing methods typically rely on reference images or extensive training, and often only address character consistency.<n>We introduce BachVid, the first training-free method that achieves consistent video generation without needing any reference images.
arXiv Detail & Related papers (2025-10-24T17:56:37Z) - VidSplice: Towards Coherent Video Inpainting via Explicit Spaced Frame Guidance [57.57195766748601]
VidSplice is a novel framework that guides inpainting process withtemporal cues.<n>We show that VidSplice achieves competitive performance across diverse video inpainting scenarios.
arXiv Detail & Related papers (2025-10-24T13:44:09Z) - Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization [38.70220886362519]
We propose Identity-Preserving Reward-guided Optimization (IPRO) for image-to-video (I2V) generation.<n>IPRO is a novel video diffusion framework based on reinforcement learning to enhance identity preservation.<n>Our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer feedback.
arXiv Detail & Related papers (2025-10-16T03:13:47Z) - Generating Human Motion Videos using a Cascaded Text-to-Video Framework [27.77921324288557]
We propose CAMEO, a cascaded framework for general human motion video generation.<n>It seamlessly bridges Text-to-Motion (T2M) models and conditional VDMs.<n>We demonstrate the effectiveness of our approach on both the MovieGen benchmark and a newly introduced benchmark tailored to the T2M-VDM combination.
arXiv Detail & Related papers (2025-10-04T19:16:28Z) - Identity-Preserving Text-to-Video Generation via Training-Free Prompt, Image, and Guidance Enhancement [58.85593321752693]
Identity-preserving text-to-video (IPT2V) generation creates videos faithful to both a reference subject image and a text prompt.<n>We introduce a Training-Free Prompt, Image, and Guidance Enhancement framework that bridges the semantic gap between the video description and the reference image.<n>We win first place in the ACM Multimedia 2025 Identity-Preserving Video Generation Challenge.
arXiv Detail & Related papers (2025-09-01T11:03:13Z) - Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis [14.980220974022982]
We introduce EVS, a training-free Encapsulated Video Synthesizer that composes T2I and T2V models to enhance both visual fidelity and motion smoothness.<n>Our approach utilizes a well-trained diffusion-based T2I model to refine low-quality video frames.<n>We also employ T2V backbones to ensure consistent motion dynamics.
arXiv Detail & Related papers (2025-07-18T08:59:02Z) - Enhancing Motion Dynamics of Image-to-Video Models via Adaptive Low-Pass Guidance [70.12690940725092]
adaptive low-pass guidance (ALG) is a simple fix to the I2V model sampling procedure to generate more dynamic videos.<n>Under VBench-I2V test suite, ALG achieves an average improvement of 36% in dynamic degree without a significant drop in video quality or image fidelity.
arXiv Detail & Related papers (2025-06-10T05:23:46Z) - SkyReels-A2: Compose Anything in Video Diffusion Transformers [27.324119455991926]
This paper presents SkyReels-A2, a controllable video generation framework capable of assembling arbitrary visual elements into synthesized videos.
We term this task elements-to-video (E2V) whose primary challenges lie in preserving the fidelity of each reference element, ensuring coherent composition of the scene, and achieving natural outputs.
We propose a novel image-text joint embedding model to inject multi-element representations into the generative process, balancing element-specific consistency with global coherence and text alignment.
arXiv Detail & Related papers (2025-04-03T09:50:50Z) - Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models [89.79067761383855]
Vchitect-2.0 is a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation.
By introducing a novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated video frames.
To overcome memory and computational bottlenecks, we propose a Memory-efficient Training framework.
arXiv Detail & Related papers (2025-01-14T21:53:11Z) - FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance [3.6519202494141125]
We introduce FancyVideo, an innovative video generator that improves the existing text-control mechanism.<n>CTGM incorporates the Temporal Information (TII) and Temporal Affinity Refiner (TAR) at the beginning and end of cross-attention.<n>Our approach achieves state-of-the-art T2V generation results on the EvalCrafter benchmark.
arXiv Detail & Related papers (2024-08-15T14:47:44Z) - StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text [58.49820807662246]
We introduce StreamingT2V, an autoregressive approach for long video generation of 80, 240, 600, 1200 or more frames with smooth transitions.
Our code will be available at: https://github.com/Picsart-AI-Research/StreamingT2V.
arXiv Detail & Related papers (2024-03-21T18:27:29Z) - VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis [18.806249040835624]
We introduce the concept of Generative Temporal Nursing (GTN), where we aim to alter the generative process on the fly during inference to improve control over the temporal dynamics.
We experimentally showcase the superiority of the proposed approach in generating longer, visually appealing videos over existing open-sourced T2V models.
arXiv Detail & Related papers (2024-03-20T10:58:58Z) - Motion-I2V: Consistent and Controllable Image-to-Video Generation with
Explicit Motion Modeling [62.19142543520805]
Motion-I2V is a framework for consistent and controllable image-to-video generation.
It factorizes I2V into two stages with explicit motion modeling.
Motion-I2V's second stage naturally supports zero-shot video-to-video translation.
arXiv Detail & Related papers (2024-01-29T09:06:43Z) - E2HQV: High-Quality Video Generation from Event Camera via
Theory-Inspired Model-Aided Deep Learning [53.63364311738552]
Bio-inspired event cameras or dynamic vision sensors are capable of capturing per-pixel brightness changes (called event-streams) in high temporal resolution and high dynamic range.
It calls for events-to-video (E2V) solutions which take event-streams as input and generate high quality video frames for intuitive visualization.
We propose textbfE2HQV, a novel E2V paradigm designed to produce high-quality video frames from events.
arXiv Detail & Related papers (2024-01-16T05:10:50Z) - I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models [80.32562822058924]
Text-guided image-to-video (I2V) generation aims to generate a coherent video that preserves the identity of the input image.
I2V-Adapter adeptly propagates the unnoised input image to subsequent noised frames through a cross-frame attention mechanism.
Our experimental results demonstrate that I2V-Adapter is capable of producing high-quality videos.
arXiv Detail & Related papers (2023-12-27T19:11:50Z) - ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation [33.37279673304]
We introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text.
ConditionVideo generates realistic dynamic videos from random noise or given scene videos.
Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods.
arXiv Detail & Related papers (2023-10-11T17:46:28Z) - Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps.
We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process.
Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.