Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think
- URL: http://arxiv.org/abs/2503.00948v1
- Date: Sun, 02 Mar 2025 16:06:16 GMT
- Title: Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think
- Authors: Jie Tian, Xiaoye Qu, Zhenyi Lu, Wei Wei, Sichen Liu, Yu Cheng,
- Abstract summary: Image-to-Video (I2V) generation aims to synthesize a video clip according to a given image and condition (e.g., text)<n>Key challenge of this task lies in simultaneously generating natural motions while preserving the original appearance of the images.<n>We propose a novel Extrapolating and Decoupling framework, which introduces model merging techniques to the I2V domain for the first time.
- Score: 24.308538128761985
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image-to-Video (I2V) generation aims to synthesize a video clip according to a given image and condition (e.g., text). The key challenge of this task lies in simultaneously generating natural motions while preserving the original appearance of the images. However, current I2V diffusion models (I2V-DMs) often produce videos with limited motion degrees or exhibit uncontrollable motion that conflicts with the textual condition. To address these limitations, we propose a novel Extrapolating and Decoupling framework, which introduces model merging techniques to the I2V domain for the first time. Specifically, our framework consists of three separate stages: (1) Starting with a base I2V-DM, we explicitly inject the textual condition into the temporal module using a lightweight, learnable adapter and fine-tune the integrated model to improve motion controllability. (2) We introduce a training-free extrapolation strategy to amplify the dynamic range of the motion, effectively reversing the fine-tuning process to enhance the motion degree significantly. (3) With the above two-stage models excelling in motion controllability and degree, we decouple the relevant parameters associated with each type of motion ability and inject them into the base I2V-DM. Since the I2V-DM handles different levels of motion controllability and dynamics at various denoising time steps, we adjust the motion-aware parameters accordingly over time. Extensive qualitative and quantitative experiments have been conducted to demonstrate the superiority of our framework over existing methods.
Related papers
- EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models [73.96414072072048]
Existing motion transfer methods explored the motion representations of reference videos to guide generation.
We propose EfficientMT, a novel and efficient end-to-end framework for video motion transfer.
Our experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability.
arXiv Detail & Related papers (2025-03-25T05:51:14Z) - MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching [27.28898943916193]
Text-to-video (T2V) diffusion models have promising capabilities in synthesizing realistic videos from input text prompts.
In this work, we tackle the motion customization problem, where a reference video is provided as motion guidance.
We propose MotionMatcher, a motion customization framework that fine-tunes the pre-trained T2V diffusion model at the feature level.
arXiv Detail & Related papers (2025-02-18T19:12:51Z) - Motion-Aware Generative Frame Interpolation [23.380470636851022]
Flow-based frame methods ensure motion stability through estimated intermediate flow but often introduce severe artifacts in complex motion regions.
Recent generative approaches, boosted by large-scale pre-trained video generation models, show promise in handling intricate scenes.
We propose Motion-aware Generative frame (MoG) that synergizes intermediate flow guidance with generative capacities to enhance fidelity.
arXiv Detail & Related papers (2025-01-07T11:03:43Z) - MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation [55.238542326124545]
Image-to-video (I2V) generation is conditioned on the static image, which has been enhanced recently by the motion intensity as an additional control signal.<n>These motion-aware models are appealing to generate diverse motion patterns, yet there lacks a reliable motion estimator for training such models on large-scale video set in the wild.<n>This paper addresses the challenge with a new motion estimator, capable of measuring the decoupled motion intensities of objects and cameras in video.
arXiv Detail & Related papers (2024-12-08T08:12:37Z) - Generalizable Implicit Motion Modeling for Video Frame Interpolation [51.966062283735596]
Motion is critical in flow-based Video Frame Interpolation (VFI)
We introduce General Implicit Motion Modeling (IMM), a novel and effective approach to motion modeling VFI.
Our GIMM can be easily integrated with existing flow-based VFI works by supplying accurately modeled motion.
arXiv Detail & Related papers (2024-07-11T17:13:15Z) - BAMM: Bidirectional Autoregressive Motion Model [14.668729995275807]
Bidirectional Autoregressive Motion Model (BAMM) is a novel text-to-motion generation framework.
BAMM consists of two key components: a motion tokenizer that transforms 3D human motion into discrete tokens in latent space, and a masked self-attention transformer that autoregressively predicts randomly masked tokens.
This feature enables BAMM to simultaneously achieving high-quality motion generation with enhanced usability and built-in motion editability.
arXiv Detail & Related papers (2024-03-28T14:04:17Z) - Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs.
SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions.
Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z) - Motion-I2V: Consistent and Controllable Image-to-Video Generation with
Explicit Motion Modeling [62.19142543520805]
Motion-I2V is a framework for consistent and controllable image-to-video generation.
It factorizes I2V into two stages with explicit motion modeling.
Motion-I2V's second stage naturally supports zero-shot video-to-video translation.
arXiv Detail & Related papers (2024-01-29T09:06:43Z) - Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [49.298187741014345]
Current methods intertwine spatial content and temporal dynamics together, leading to an increased complexity of text-to-video generation (T2V)
We propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives.
arXiv Detail & Related papers (2023-12-07T17:59:07Z) - Decouple Content and Motion for Conditional Image-to-Video Generation [6.634105805557556]
conditional image-to-video (cI2V) generation is to create a believable new video by beginning with the condition, i.e., one image and text.
Previous cI2V generation methods conventionally perform in RGB pixel space, with limitations in modeling motion consistency and visual continuity.
We propose a novel approach by disentangling the target RGB pixels into two distinct components: spatial content and temporal motions.
arXiv Detail & Related papers (2023-11-24T06:08:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.