AnyI2V: Animating Any Conditional Image with Motion Control
- URL: http://arxiv.org/abs/2507.02857v1
- Date: Thu, 03 Jul 2025 17:59:02 GMT
- Title: AnyI2V: Animating Any Conditional Image with Motion Control
- Authors: Ziye Li, Hao Luo, Xincheng Shuai, Henghui Ding,
- Abstract summary: We propose AnyI2V, a training-free framework that animates any conditional images with user-defined motion trajectories.<n>Experiments demonstrate that the proposed AnyI2V achieves superior performance and provides a new perspective in spatial- and motion-controlled video generation.
- Score: 25.49332963076066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in video generation, particularly in diffusion models, have driven notable progress in text-to-video (T2V) and image-to-video (I2V) synthesis. However, challenges remain in effectively integrating dynamic motion signals and flexible spatial constraints. Existing T2V methods typically rely on text prompts, which inherently lack precise control over the spatial layout of generated content. In contrast, I2V methods are limited by their dependence on real images, which restricts the editability of the synthesized content. Although some methods incorporate ControlNet to introduce image-based conditioning, they often lack explicit motion control and require computationally expensive training. To address these limitations, we propose AnyI2V, a training-free framework that animates any conditional images with user-defined motion trajectories. AnyI2V supports a broader range of modalities as the conditional image, including data types such as meshes and point clouds that are not supported by ControlNet, enabling more flexible and versatile video generation. Additionally, it supports mixed conditional inputs and enables style transfer and editing via LoRA and text prompts. Extensive experiments demonstrate that the proposed AnyI2V achieves superior performance and provides a new perspective in spatial- and motion-controlled video generation. Code is available at https://henghuiding.com/AnyI2V/.
Related papers
- Incorporating Flexible Image Conditioning into Text-to-Video Diffusion Models without Training [27.794381157153776]
We introduce a unified formulation for TI2V generation with flexible visual conditioning.<n>We propose an innovative training-free approach, dubbed FlexTI2V, that can condition T2V foundation models on an arbitrary amount of images.<n>Our method surpasses previous training-free image conditioning methods by a notable margin.
arXiv Detail & Related papers (2025-05-27T02:16:06Z) - Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think [24.308538128761985]
Image-to-Video (I2V) generation aims to synthesize a video clip according to a given image and condition (e.g., text)<n>Key challenge of this task lies in simultaneously generating natural motions while preserving the original appearance of the images.<n>We propose a novel Extrapolating and Decoupling framework, which introduces model merging techniques to the I2V domain for the first time.
arXiv Detail & Related papers (2025-03-02T16:06:16Z) - VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation [62.64811405314847]
VidCRAFT3 is a novel framework for precise image-to-video generation.<n>It enables control over camera motion, object motion, and lighting direction simultaneously.<n>It produces high-quality video content, outperforming state-of-the-art methods in control granularity and visual coherence.
arXiv Detail & Related papers (2025-02-11T13:11:59Z) - Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning [26.44634685830323]
We propose a novel framework called DEcomposed MOtion (DEMO) to enhance motion synthesis in Text-to-Video (T2V) generation.
Our method includes a content encoder for static elements and a motion encoder for temporal dynamics, alongside separate content and motion conditioning mechanisms.
We demonstrate DEMO's superior ability to produce videos with enhanced motion dynamics while maintaining high visual quality.
arXiv Detail & Related papers (2024-10-31T17:59:53Z) - Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models [52.28245595257831]
Cross-attention guidance can be a promising approach for editing videos.
We show that despite the limitations of current T2V models, cross-attention guidance can be a promising approach for editing videos.
arXiv Detail & Related papers (2024-04-08T13:40:01Z) - Motion-I2V: Consistent and Controllable Image-to-Video Generation with
Explicit Motion Modeling [62.19142543520805]
Motion-I2V is a framework for consistent and controllable image-to-video generation.
It factorizes I2V into two stages with explicit motion modeling.
Motion-I2V's second stage naturally supports zero-shot video-to-video translation.
arXiv Detail & Related papers (2024-01-29T09:06:43Z) - I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models [80.32562822058924]
Text-guided image-to-video (I2V) generation aims to generate a coherent video that preserves the identity of the input image.
I2V-Adapter adeptly propagates the unnoised input image to subsequent noised frames through a cross-frame attention mechanism.
Our experimental results demonstrate that I2V-Adapter is capable of producing high-quality videos.
arXiv Detail & Related papers (2023-12-27T19:11:50Z) - SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models [84.71887272654865]
We present SparseCtrl to enable flexible structure control with temporally sparse signals.
It incorporates an additional condition to process these sparse signals while leaving the pre-trained T2V model untouched.
The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images.
arXiv Detail & Related papers (2023-11-28T16:33:08Z) - ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation [33.37279673304]
We introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text.
ConditionVideo generates realistic dynamic videos from random noise or given scene videos.
Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods.
arXiv Detail & Related papers (2023-10-11T17:46:28Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z) - Make It Move: Controllable Image-to-Video Generation with Text
Descriptions [69.52360725356601]
TI2V task aims at generating videos from a static image and a text description.
To address these challenges, we propose a Motion Anchor-based video GEnerator (MAGE) with an innovative motion anchor structure.
Experiments conducted on datasets verify the effectiveness of MAGE and show appealing potentials of TI2V task.
arXiv Detail & Related papers (2021-12-06T07:00:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.