Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object
Video Generation
- URL: http://arxiv.org/abs/2306.03988v2
- Date: Sun, 14 Jan 2024 00:29:39 GMT
- Title: Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object
Video Generation
- Authors: Aram Davtyan and Paolo Favaro
- Abstract summary: We propose an unsupervised method to generate videos from a single frame and a sparse motion input.
Our trained model can generate unseen realistic object-to-object interactions.
We show that YODA is on par with or better than state of the art video generation prior work in terms of both controllability and video quality.
- Score: 26.292052071093945
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a novel unsupervised method to autoregressively generate videos
from a single frame and a sparse motion input. Our trained model can generate
unseen realistic object-to-object interactions. Although our model has never
been given the explicit segmentation and motion of each object in the scene
during training, it is able to implicitly separate their dynamics and extents.
Key components in our method are the randomized conditioning scheme, the
encoding of the input motion control, and the randomized and sparse sampling to
enable generalization to out of distribution but realistic correlations. Our
model, which we call YODA, has therefore the ability to move objects without
physically touching them. Through extensive qualitative and quantitative
evaluations on several datasets, we show that YODA is on par with or better
than state of the art video generation prior work in terms of both
controllability and video quality.
Related papers
- ObjectMover: Generative Object Movement with Video Prior [69.75281888309017]
We present ObjectMover, a generative model that can perform object movement in challenging scenes.
We show that with this approach, our model is able to adjust to complex real-world scenarios.
We propose a multi-task learning strategy that enables training on real-world video data to improve the model generalization.
arXiv Detail & Related papers (2025-03-11T04:42:59Z) - C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation [81.4106601222722]
Trajectory-based motion control has emerged as an intuitive and efficient approach for controllable video generation.
We propose a Chain-of-Thought-based motion controller for controllable video generation, named C-Drag.
Our method includes an object perception module and a Chain-of-Thought-based motion reasoning module.
arXiv Detail & Related papers (2025-02-27T08:21:03Z) - Articulate That Object Part (ATOP): 3D Part Articulation via Text and Motion Personalization [9.231848716070257]
ATOP (Articulate That Object Part) is a novel few-shot method based on motion personalization to articulate a static 3D object.
We show that our method is capable of generating realistic motion videos and predicting 3D motion parameters in a more accurate and generalizable way.
arXiv Detail & Related papers (2025-02-11T05:47:16Z) - InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor.
Our key insight is that large video generation models can act as both neurals and implicit physics simulators'', having learned interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z) - Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback [130.090296560882]
We investigate the use of feedback to enhance the object dynamics in text-to-video models.
We show that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions.
arXiv Detail & Related papers (2024-12-03T17:44:23Z) - Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics [67.97235923372035]
We present Puppet-Master, an interactive video generative model that can serve as a motion prior for part-level dynamics.
At test time, given a single image and a sparse set of motion trajectories, Puppet-Master can synthesize a video depicting realistic part-level motion faithful to the given drag interactions.
arXiv Detail & Related papers (2024-08-08T17:59:38Z) - EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting [95.44545809256473]
EgoGaussian is a method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone.
We show significant improvements in terms of both dynamic object and background reconstruction quality compared to the state-of-the-art.
arXiv Detail & Related papers (2024-06-28T10:39:36Z) - Purposer: Putting Human Motion Generation in Context [30.706219830149504]
We present a novel method to generate human motion to populate 3D indoor scenes.
It can be controlled with various combinations of conditioning signals such as a path in a scene, target poses, past motions, and scenes represented as 3D point clouds.
arXiv Detail & Related papers (2024-04-19T15:16:04Z) - CAGE: Unsupervised Visual Composition and Animation for Controllable Video Generation [42.475807996071175]
We introduce an unsupervised approach to controllable and compositional video generation.
Our model is trained from scratch on a dataset of unannotated videos.
It can compose plausible novel scenes and animate objects by placing object parts at the desired locations in space and time.
arXiv Detail & Related papers (2024-03-21T12:50:15Z) - CAGE: Controllable Articulation GEneration [14.002289666443529]
We leverage the interplay between part shape, connectivity, and motion using a denoising diffusion-based method.
Our method takes an object category label and a part connectivity graph as input and generates an object's geometry and motion parameters.
Our experiments show that our method outperforms the state-of-the-art in articulated object generation.
arXiv Detail & Related papers (2023-12-15T07:04:27Z) - TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control.
A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects.
generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z) - ROAM: Robust and Object-Aware Motion Generation Using Neural Pose
Descriptors [73.26004792375556]
This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object.
We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object.
We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
arXiv Detail & Related papers (2023-08-24T17:59:51Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects.
Our framework is a non-trivial adaptation from image generation methods, and is new to this field.
Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z) - Understanding Object Dynamics for Interactive Image-to-Video Synthesis [8.17925295907622]
We present an approach that learns naturally-looking global articulations caused by a local manipulation at a pixel level.
Our generative model learns to infer natural object dynamics as a response to user interaction.
In contrast to existing work on video prediction, we do not synthesize arbitrary realistic videos.
arXiv Detail & Related papers (2021-06-21T17:57:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.