DIMO: Diverse 3D Motion Generation for Arbitrary Objects
- URL: http://arxiv.org/abs/2511.07409v1
- Date: Mon, 10 Nov 2025 18:56:49 GMT
- Title: DIMO: Diverse 3D Motion Generation for Arbitrary Objects
- Authors: Linzhan Mou, Jiahui Lei, Chen Wang, Lingjie Liu, Kostas Daniilidis,
- Abstract summary: DIMO is a generative approach capable of generating diverse 3D motions for arbitrary objects from a single image.<n>We leverage the rich priors in well-trained video models to extract the common motion patterns.<n>During inference time with learned latent space, we can instantly sample diverse 3D motions in a single-forward pass.
- Score: 57.14954351767432
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present DIMO, a generative approach capable of generating diverse 3D motions for arbitrary objects from a single image. The core idea of our work is to leverage the rich priors in well-trained video models to extract the common motion patterns and then embed them into a shared low-dimensional latent space. Specifically, we first generate multiple videos of the same object with diverse motions. We then embed each motion into a latent vector and train a shared motion decoder to learn the distribution of motions represented by a structured and compact motion representation, i.e., neural key point trajectories. The canonical 3D Gaussians are then driven by these key points and fused to model the geometry and appearance. During inference time with learned latent space, we can instantly sample diverse 3D motions in a single-forward pass and support several interesting applications including 3D motion interpolation and language-guided motion generation. Our project page is available at https://linzhanm.github.io/dimo.
Related papers
- UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework [54.337290937468175]
We propose UniMo, an autoregressive model for joint modeling of 2D human videos and 3D human motions within a unified framework.<n>We show that our method simultaneously generates corresponding videos and motions while performing accurate motion capture.
arXiv Detail & Related papers (2025-12-03T16:03:18Z) - MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps [31.864441290577545]
This paper addresses the challenge of learning semantically and functionally meaningful 3D motion priors from real-world videos.<n>We propose a pixel-aligned Motion Map representation for 3D scene motion, which can be generated from existing generative image models.
arXiv Detail & Related papers (2025-10-13T07:56:19Z) - In-2-4D: Inbetweening from Two Single-View Images to 4D Generation [63.68181731564576]
We propose a new problem, Inbetween-2-4D, for generative 4D (i.e., 3D + motion) in interpolate two single-view images.<n>In contrast to video/4D generation from only text or a single image, our interpolative task can leverage more precise motion control to better constrain the generation.
arXiv Detail & Related papers (2025-04-11T09:01:09Z) - Recovering Dynamic 3D Sketches from Videos [30.87733869892925]
Liv3Stroke is a novel approach for abstracting objects in motion with deformable 3D strokes.<n>We first extract noisy, 3D point cloud motion guidance from video frames using semantic features.<n>Our approach deforms a set of curves to abstract essential motion features as a set of explicit 3D representations.
arXiv Detail & Related papers (2025-03-26T08:43:21Z) - Articulate That Object Part (ATOP): 3D Part Articulation via Text and Motion Personalization [9.231848716070257]
ATOP (Articulate That Object Part) is a novel few-shot method based on motion personalization to articulate a static 3D object.<n>We show that our method is capable of generating realistic motion videos and predicting 3D motion parameters in a more accurate and generalizable way.
arXiv Detail & Related papers (2025-02-11T05:47:16Z) - Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach [54.559847511280545]
We present a novel video generation framework that integrates 3-dimensional geometry and dynamic awareness.<n>To achieve this, we augment 2D videos with 3D point trajectories and align them in pixel space.<n>The resulting 3D-aware video dataset, PointVid, is then used to fine-tune a latent diffusion model.
arXiv Detail & Related papers (2025-02-05T21:49:06Z) - 3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation [83.98251722144195]
Previous methods on controllable video generation primarily leverage 2D control signals to manipulate object motions.<n>We introduce 3DTrajMaster, a robust controller that regulates multi-entity dynamics in 3D space.<n>We show that 3DTrajMaster sets a new state-of-the-art in both accuracy and generalization for controlling multi-entity 3D motions.
arXiv Detail & Related papers (2024-12-10T18:55:13Z) - MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion [57.90404618420159]
We introduce Multi-view Ancestral Sampling (MAS), a method for 3D motion generation.
MAS works by simultaneously denoising multiple 2D motion sequences representing different views of the same 3D motion.
We demonstrate MAS on 2D pose data acquired from videos depicting professional basketball maneuvers.
arXiv Detail & Related papers (2023-10-23T09:05:18Z) - MotionBERT: A Unified Perspective on Learning Human Motion
Representations [46.67364057245364]
We present a unified perspective on tackling various human-centric video tasks by learning human motion representations from large-scale and heterogeneous data resources.
We propose a pretraining stage in which a motion encoder is trained to recover the underlying 3D motion from noisy partial 2D observations.
We implement motion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer) neural network.
arXiv Detail & Related papers (2022-10-12T19:46:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.