Articulate That Object Part (ATOP): 3D Part Articulation from Text and Motion Personalization
- URL: http://arxiv.org/abs/2502.07278v1
- Date: Tue, 11 Feb 2025 05:47:16 GMT
- Title: Articulate That Object Part (ATOP): 3D Part Articulation from Text and Motion Personalization
- Authors: Aditya Vora, Sauradip Nag, Hao Zhang,
- Abstract summary: ATOP (Articulate That Object Part) is a novel method based on motion personalization to articulate a 3D object.
Text input allows us to tap into the power of modern-day video diffusion to generate plausible motion samples.
In turn, the input 3D object provides image prompting to personalize the generated video to that very object we wish to articulate.
- Score: 9.231848716070257
- License:
- Abstract: We present ATOP (Articulate That Object Part), a novel method based on motion personalization to articulate a 3D object with respect to a part and its motion as prescribed in a text prompt. Specifically, the text input allows us to tap into the power of modern-day video diffusion to generate plausible motion samples for the right object category and part. In turn, the input 3D object provides image prompting to personalize the generated video to that very object we wish to articulate. Our method starts with a few-shot finetuning for category-specific motion generation, a key first step to compensate for the lack of articulation awareness by current video diffusion models. For this, we finetune a pre-trained multi-view image generation model for controllable multi-view video generation, using a small collection of video samples obtained for the target object category. This is followed by motion video personalization that is realized by multi-view rendered images of the target 3D object. At last, we transfer the personalized video motion to the target 3D object via differentiable rendering to optimize part motion parameters by a score distillation sampling loss. We show that our method is capable of generating realistic motion videos and predict 3D motion parameters in a more accurate and generalizable way, compared to prior works.
Related papers
- VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models.
Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference.
We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z) - Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer [27.278989809466392]
We present a new method for text-driven motion transfer - synthesizing a video that complies with an input text prompt describing the target objects and scene.
We leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors.
arXiv Detail & Related papers (2023-11-28T18:03:27Z) - ROAM: Robust and Object-Aware Motion Generation Using Neural Pose
Descriptors [73.26004792375556]
This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object.
We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object.
We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
arXiv Detail & Related papers (2023-08-24T17:59:51Z) - Delving into Motion-Aware Matching for Monocular 3D Object Tracking [81.68608983602581]
We find that the motion cue of objects along different time frames is critical in 3D multi-object tracking.
We propose MoMA-M3T, a framework that mainly consists of three motion-aware components.
We conduct extensive experiments on the nuScenes and KITTI datasets to demonstrate our MoMA-M3T achieves competitive performance against state-of-the-art methods.
arXiv Detail & Related papers (2023-08-22T17:53:58Z) - Temporal View Synthesis of Dynamic Scenes through 3D Object Motion
Estimation with Multi-Plane Images [8.185918509343816]
We study the problem of temporal view synthesis (TVS), where the goal is to predict the next frames of a video.
In this work, we consider the TVS of dynamic scenes in which both the user and objects are moving.
We predict the motion of objects by isolating and estimating the 3D object motion in the past frames and then extrapolating it.
arXiv Detail & Related papers (2022-08-19T17:40:13Z) - Motion-from-Blur: 3D Shape and Motion Estimation of Motion-blurred
Objects in Videos [115.71874459429381]
We propose a method for jointly estimating the 3D motion, 3D shape, and appearance of highly motion-blurred objects from a video.
Experiments on benchmark datasets demonstrate that our method outperforms previous methods for fast moving object deblurring and 3D reconstruction.
arXiv Detail & Related papers (2021-11-29T11:25:14Z) - NeuralDiff: Segmenting 3D objects that move in egocentric videos [92.95176458079047]
We study the problem of decomposing the observed 3D scene into a static background and a dynamic foreground.
This task is reminiscent of the classic background subtraction problem, but is significantly harder because all parts of the scene, static and dynamic, generate a large apparent motion.
In particular, we consider egocentric videos and further separate the dynamic component into objects and the actor that observes and moves them.
arXiv Detail & Related papers (2021-10-19T12:51:35Z) - Flow Guided Transformable Bottleneck Networks for Motion Retargeting [29.16125343915916]
Existing efforts leverage a long training video from each target person to train a subject-specific motion transfer model.
Few-shot motion transfer techniques, which only require one or a few images from a target, have recently drawn considerable attention.
Inspired by the Transformable Bottleneck Network, we propose an approach based on an implicit volumetric representation of the image content.
arXiv Detail & Related papers (2021-06-14T21:58:30Z) - First Order Motion Model for Image Animation [90.712718329677]
Image animation consists of generating a video sequence so that an object in a source image is animated according to the motion of a driving video.
Our framework addresses this problem without using any annotation or prior information about the specific object to animate.
arXiv Detail & Related papers (2020-02-29T07:08:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.