Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object
Video Generation
- URL: http://arxiv.org/abs/2306.03988v2
- Date: Sun, 14 Jan 2024 00:29:39 GMT
- Title: Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object
Video Generation
- Authors: Aram Davtyan and Paolo Favaro
- Abstract summary: We propose an unsupervised method to generate videos from a single frame and a sparse motion input.
Our trained model can generate unseen realistic object-to-object interactions.
We show that YODA is on par with or better than state of the art video generation prior work in terms of both controllability and video quality.
- Score: 26.292052071093945
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a novel unsupervised method to autoregressively generate videos
from a single frame and a sparse motion input. Our trained model can generate
unseen realistic object-to-object interactions. Although our model has never
been given the explicit segmentation and motion of each object in the scene
during training, it is able to implicitly separate their dynamics and extents.
Key components in our method are the randomized conditioning scheme, the
encoding of the input motion control, and the randomized and sparse sampling to
enable generalization to out of distribution but realistic correlations. Our
model, which we call YODA, has therefore the ability to move objects without
physically touching them. Through extensive qualitative and quantitative
evaluations on several datasets, we show that YODA is on par with or better
than state of the art video generation prior work in terms of both
controllability and video quality.
Related papers
- VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models [71.9811050853964]
VideoJAM is a novel framework that instills an effective motion prior to video generators.
VideoJAM achieves state-of-the-art performance in motion coherence.
These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation.
arXiv Detail & Related papers (2025-02-04T17:07:10Z) - InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor.
Our key insight is that large video foundation models can act as both neurals and implicit physics simulators by learning interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z) - Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback [130.090296560882]
We investigate the use of feedback to enhance the object dynamics in text-to-video models.
We show that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions.
arXiv Detail & Related papers (2024-12-03T17:44:23Z) - Motion Dreamer: Realizing Physically Coherent Video Generation through Scene-Aware Motion Reasoning [27.690736225683825]
We propose bfMotion Dreamer, a two-stage video generation framework.
By decoupling motion reasoning from high-fidelity video synthesis, our approach allows for more accurate and physically plausible motion generation.
Our work opens new avenues in creating models that can reason about physical interactions in a more coherent and realistic manner.
arXiv Detail & Related papers (2024-11-30T17:40:49Z) - Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics [67.97235923372035]
We present Puppet-Master, an interactive video generative model that can serve as a motion prior for part-level dynamics.
At test time, given a single image and a sparse set of motion trajectories, Puppet-Master can synthesize a video depicting realistic part-level motion faithful to the given drag interactions.
arXiv Detail & Related papers (2024-08-08T17:59:38Z) - TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control.
A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects.
generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z) - ROAM: Robust and Object-Aware Motion Generation Using Neural Pose
Descriptors [73.26004792375556]
This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object.
We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object.
We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
arXiv Detail & Related papers (2023-08-24T17:59:51Z) - Understanding Object Dynamics for Interactive Image-to-Video Synthesis [8.17925295907622]
We present an approach that learns naturally-looking global articulations caused by a local manipulation at a pixel level.
Our generative model learns to infer natural object dynamics as a response to user interaction.
In contrast to existing work on video prediction, we do not synthesize arbitrary realistic videos.
arXiv Detail & Related papers (2021-06-21T17:57:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.