Related papers: MotionBridge: Dynamic Video Inbetweening with Flexible Controls

MotionBridge: Dynamic Video Inbetweening with Flexible Controls

URL: http://arxiv.org/abs/2412.13190v3
Date: Tue, 07 Jan 2025 22:06:07 GMT
Title: MotionBridge: Dynamic Video Inbetweening with Flexible Controls
Authors: Maham Tanveer, Yang Zhou, Simon Niklaus, Ali Mahdavi Amiri, Hao Zhang, Krishna Kumar Singh, Nanxuan Zhao,
Abstract summary: We introduce MotionBridge, a unified video inbetweening framework.<n>It allows flexible controls, including trajectory strokes, video editing masks, guide pixels, and text video.<n>We show that such multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.
Score: 29.029643539300434
License: http://creativecommons.org/licenses/by/4.0/
Abstract: By generating plausible and smooth transitions between two image frames, video inbetweening is an essential tool for video editing and long video synthesis. Traditional works lack the capability to generate complex large motions. While recent video generation techniques are powerful in creating high-quality results, they often lack fine control over the details of intermediate frames, which can lead to results that do not align with the creative mind. We introduce MotionBridge, a unified video inbetweening framework that allows flexible controls, including trajectory strokes, keyframes, masks, guide pixels, and text. However, learning such multi-modal controls in a unified framework is a challenging task. We thus design two generators to extract the control signal faithfully and encode feature through dual-branch embedders to resolve ambiguities. We further introduce a curriculum training strategy to smoothly learn various controls. Extensive qualitative and quantitative experiments have demonstrated that such multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.

Related papers

AnyI2V: Animating Any Conditional Image with Motion Control [25.49332963076066]
We propose AnyI2V, a training-free framework that animates any conditional images with user-defined motion trajectories.<n>Experiments demonstrate that the proposed AnyI2V achieves superior performance and provides a new perspective in spatial- and motion-controlled video generation.
arXiv Detail & Related papers (2025-07-03T17:59:02Z)
OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions [96.31455979495398]
We develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video.<n>We also propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE)<n>Our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-06-29T18:43:00Z)
Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models [59.62564091684881]
We present Frame Guidance, a training-free guidance for controllable video generation based on frame-level signals.<n>For practical training-free guidance, we propose a simple latent processing method that dramatically reduces memory usage.<n>We apply a novel latent optimization strategy designed for globally coherent video generation.
arXiv Detail & Related papers (2025-06-08T14:54:41Z)
ATI: Any Trajectory Instruction for Controllable Video Generation [25.249489701215467]
We propose a unified framework for motion control in video generation that seamlessly integrates camera movement, object-level translation, and fine-grained local motion.<n>Our approach offers a cohesive solution by projecting user-defined trajectories into the latent space of pre-trained image-to-video generation models.
arXiv Detail & Related papers (2025-05-28T23:49:18Z)
VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation [62.64811405314847]
VidCRAFT3 is a novel framework for precise image-to-video generation. It enables control over camera motion, object motion, and lighting direction simultaneously. It produces high-quality video content, outperforming state-of-the-art methods in control granularity and visual coherence.
arXiv Detail & Related papers (2025-02-11T13:11:59Z)
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation [54.30327187663316]
DiTCtrl is a training-free multi-prompt video generation method under MM-DiT architectures for the first time. We analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts.
arXiv Detail & Related papers (2024-12-24T18:51:19Z)
Motion Prompting: Controlling Video Generation with Motion Trajectories [57.049252242807874]
We train a video generation model conditioned on sparse or dense video trajectories.<n>We translate high-level user requests into detailed, semi-dense motion prompts.<n>We demonstrate our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing.
arXiv Detail & Related papers (2024-12-03T18:59:56Z)
I2VControl: Disentangled and Unified Video Motion Synthesis Control [11.83645633418189]
We present a disentangled and unified framework, namely I2VControl, that unifies multiple motion control tasks in image-to-video synthesis.<n>Our approach partitions the video into individual motion units and represents each unit with disentangled control signals.<n>Our methodology seamlessly integrates as a plug-in for pre-trained models and remains agnostic to specific model architectures.
arXiv Detail & Related papers (2024-11-26T04:21:22Z)
AnimateAnything: Consistent and Controllable Animation for Video Generation [24.576022028967195]
We present a unified controllable video generation approach AnimateAnything. It facilitates precise and consistent video manipulation across various conditions. Experiments demonstrate that our method outperforms the state-of-the-art approaches.
arXiv Detail & Related papers (2024-11-16T16:36:49Z)
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control [42.506988751934685]
We present DreamVideo-2, a zero-shot video customization framework capable of generating videos with a specific subject and motion trajectory. Specifically, we introduce reference attention, which leverages the model's inherent capabilities for subject learning. We devise a mask-guided motion module to achieve precise motion control by fully utilizing the robust motion signal of box masks.
arXiv Detail & Related papers (2024-10-17T17:52:57Z)
Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention [62.2447324481159]
Cavia is a novel framework for camera-controllable, multi-view video generation. Our framework extends the spatial and temporal attention modules, improving both viewpoint and temporal consistency. Cavia is the first of its kind that allows the user to specify distinct camera motion while obtaining object motion.
arXiv Detail & Related papers (2024-10-14T17:46:32Z)
TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control. A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects. generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z)
LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [44.220329202024494]
We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU. Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation. To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
arXiv Detail & Related papers (2023-10-16T19:03:19Z)
ControlVideo: Training-free Controllable Text-to-Video Generation [117.06302461557044]
ControlVideo is a framework to enable natural and efficient text-to-video generation. It generates both short and long videos within several minutes using one NVIDIA 2080Ti.
arXiv Detail & Related papers (2023-05-22T14:48:53Z)
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions. Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z)
A Good Image Generator Is What You Need for High-Resolution Video Synthesis [73.82857768949651]
We present a framework that leverages contemporary image generators to render high-resolution videos. We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator. We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled.
arXiv Detail & Related papers (2021-04-30T15:38:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.