Related papers: Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

URL: http://arxiv.org/abs/2401.15977v2
Date: Wed, 31 Jan 2024 07:41:04 GMT
Title: Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling
Authors: Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, Hongsheng Li
Abstract summary: Motion-I2V is a framework for consistent and controllable image-to-video generation. It factorizes I2V into two stages with explicit motion modeling. Motion-I2V's second stage naturally supports zero-shot video-to-video translation.
Score: 62.19142543520805
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We introduce Motion-I2V, a novel framework for consistent and controllable image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusion-based motion field predictor, which focuses on deducing the trajectories of the reference image's pixels. For the second stage, we propose motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models. This module can effectively propagate reference image's feature to synthesized frames with the guidance of predicted trajectories from the first stage. Compared with existing methods, Motion-I2V can generate more consistent videos even at the presence of large motion and viewpoint variation. By training a sparse trajectory ControlNet for the first stage, Motion-I2V can support users to precisely control motion trajectories and motion regions with sparse trajectory and region annotations. This offers more controllability of the I2V process than solely relying on textual instructions. Additionally, Motion-I2V's second stage naturally supports zero-shot video-to-video translation. Both qualitative and quantitative comparisons demonstrate the advantages of Motion-I2V over prior approaches in consistent and controllable image-to-video generation. Please see our project page at https://xiaoyushi97.github.io/Motion-I2V/.

Related papers

In-2-4D: Inbetweening from Two Single-View Images to 4D Generation [54.62824686338408]
We propose a new problem, In-between2-4D, for generative 4D (i.e., 3D + motion) in Splating from a minimalistic input setting. Given two images representing the start and end states of an object in motion, our goal is to generate and reconstruct the motion in 4D.
arXiv Detail & Related papers (2025-04-11T09:01:09Z)
EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models [73.96414072072048]
Existing motion transfer methods explored the motion representations of reference videos to guide generation. We propose EfficientMT, a novel and efficient end-to-end framework for video motion transfer. Our experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability.
arXiv Detail & Related papers (2025-03-25T05:51:14Z)
Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think [24.308538128761985]
Image-to-Video (I2V) generation aims to synthesize a video clip according to a given image and condition (e.g., text) Key challenge of this task lies in simultaneously generating natural motions while preserving the original appearance of the images. We propose a novel Extrapolating and Decoupling framework, which introduces model merging techniques to the I2V domain for the first time.
arXiv Detail & Related papers (2025-03-02T16:06:16Z)
MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching [27.28898943916193]
Text-to-video (T2V) diffusion models have promising capabilities in synthesizing realistic videos from input text prompts. In this work, we tackle the motion customization problem, where a reference video is provided as motion guidance. We propose MotionMatcher, a motion customization framework that fine-tunes the pre-trained T2V diffusion model at the feature level.
arXiv Detail & Related papers (2025-02-18T19:12:51Z)
MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation [55.238542326124545]
Image-to-video (I2V) generation is conditioned on the static image, which has been enhanced recently by the motion intensity as an additional control signal. These motion-aware models are appealing to generate diverse motion patterns, yet there lacks a reliable motion estimator for training such models on large-scale video set in the wild. This paper addresses the challenge with a new motion estimator, capable of measuring the decoupled motion intensities of objects and cameras in video.
arXiv Detail & Related papers (2024-12-08T08:12:37Z)
Generalizable Implicit Motion Modeling for Video Frame Interpolation [51.966062283735596]
Motion is critical in flow-based Video Frame Interpolation (VFI) We introduce General Implicit Motion Modeling (IMM), a novel and effective approach to motion modeling VFI. Our GIMM can be easily integrated with existing flow-based VFI works by supplying accurately modeled motion.
arXiv Detail & Related papers (2024-07-11T17:13:15Z)
MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model [78.11258752076046]
MOFA-Video is an advanced controllable image animation method that generates video from the given image using various additional controllable signals. We design several domain-aware motion field adapters to control the generated motions in the video generation pipeline. After training, the MOFA-Adapters in different domains can also work together for more controllable video generation.
arXiv Detail & Related papers (2024-05-30T16:22:22Z)
Spectral Motion Alignment for Video Motion Transfer using Diffusion Models [54.32923808964701]
Spectral Motion Alignment (SMA) is a framework that refines and aligns motion vectors using Fourier and wavelet transforms. SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics. Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.
arXiv Detail & Related papers (2024-03-22T14:47:18Z)
ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation [37.05422543076405]
Image-to-video (I2V) generation aims to use the initial frame (alongside a text prompt) to create a video sequence. Existing methods often struggle to preserve the integrity of the subject, background, and style from the first frame. We propose ConsistI2V, a diffusion-based method to enhance visual consistency for I2V generation.
arXiv Detail & Related papers (2024-02-06T19:08:18Z)
Decouple Content and Motion for Conditional Image-to-Video Generation [6.634105805557556]
conditional image-to-video (cI2V) generation is to create a believable new video by beginning with the condition, i.e., one image and text. Previous cI2V generation methods conventionally perform in RGB pixel space, with limitations in modeling motion consistency and visual continuity. We propose a novel approach by disentangling the target RGB pixels into two distinct components: spatial content and temporal motions.
arXiv Detail & Related papers (2023-11-24T06:08:27Z)
Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps. We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process. Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z)
MotionRNN: A Flexible Model for Video Prediction with Spacetime-Varying Motions [70.30211294212603]
This paper tackles video prediction from a new dimension of predicting spacetime-varying motions that are incessantly across both space and time. We propose the MotionRNN framework, which can capture the complex variations within motions and adapt to spacetime-varying scenarios.
arXiv Detail & Related papers (2021-03-03T08:11:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.