ROAM: Robust and Object-Aware Motion Generation Using Neural Pose
Descriptors
- URL: http://arxiv.org/abs/2308.12969v2
- Date: Thu, 15 Feb 2024 08:50:12 GMT
- Title: ROAM: Robust and Object-Aware Motion Generation Using Neural Pose
Descriptors
- Authors: Wanyue Zhang and Rishabh Dabral and Thomas Leimk\"uhler and Vladislav
Golyanik and Marc Habermann and Christian Theobalt
- Abstract summary: This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object.
We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object.
We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
- Score: 73.26004792375556
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing automatic approaches for 3D virtual character motion synthesis
supporting scene interactions do not generalise well to new objects outside
training distributions, even when trained on extensive motion capture datasets
with diverse objects and annotated interactions. This paper addresses this
limitation and shows that robustness and generalisation to novel scene objects
in 3D object-aware character synthesis can be achieved by training a motion
model with as few as one reference object. We leverage an implicit feature
representation trained on object-only datasets, which encodes an
SE(3)-equivariant descriptor field around the object. Given an unseen object
and a reference pose-object pair, we optimise for the object-aware pose that is
closest in the feature space to the reference pose. Finally, we use l-NSM,
i.e., our motion generation model that is trained to seamlessly transition from
locomotion to object interaction with the proposed bidirectional pose blending
scheme. Through comprehensive numerical comparisons to state-of-the-art methods
and in a user study, we demonstrate substantial improvements in 3D virtual
character motion and interaction quality and robustness to scenarios with
unseen objects. Our project page is available at
https://vcai.mpi-inf.mpg.de/projects/ROAM/.
Related papers
- LEIA: Latent View-invariant Embeddings for Implicit 3D Articulation [32.27869897947267]
We introduce LEIA, a novel approach for representing dynamic 3D objects.
Our method involves observing the object at distinct time steps or "states" and conditioning a hypernetwork on the current state.
By interpolating between these states, we can generate novel articulation configurations in 3D space that were previously unseen.
arXiv Detail & Related papers (2024-09-10T17:59:53Z) - HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and
Objects from Video [70.11702620562889]
HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video.
We develop a compositional articulated implicit model that can disentangled 3D hand and object from 2D images.
Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings.
arXiv Detail & Related papers (2023-11-30T10:50:35Z) - 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding [58.924180772480504]
3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description.
We propose a relation-aware one-stage framework, named 3D Relative Position-aware Network (3-Net)
arXiv Detail & Related papers (2023-07-25T09:33:25Z) - Unsupervised Kinematic Motion Detection for Part-segmented 3D Shape
Collections [14.899075941080541]
We present an unsupervised approach for discovering articulated motions in a part-segmented 3D shape collection.
Our approach is based on a concept we call category closure: any valid articulation of an object's parts should keep the object in the same semantic category.
We evaluate our approach by using it to re-discover part motions from the PartNet-Mobility dataset.
arXiv Detail & Related papers (2022-06-17T00:50:36Z) - Neural Descriptor Fields: SE(3)-Equivariant Object Representations for
Manipulation [75.83319382105894]
We present Neural Descriptor Fields (NDFs), an object representation that encodes both points and relative poses between an object and a target.
NDFs are trained in a self-supervised fashion via a 3D auto-encoding task that does not rely on expert-labeled keypoints.
Our performance generalizes across both object instances and 6-DoF object poses, and significantly outperforms a recent baseline that relies on 2D descriptors.
arXiv Detail & Related papers (2021-12-09T18:57:15Z) - Estimating 3D Motion and Forces of Human-Object Interactions from
Internet Videos [49.52070710518688]
We introduce a method to reconstruct the 3D motion of a person interacting with an object from a single RGB video.
Our method estimates the 3D poses of the person together with the object pose, the contact positions and the contact forces on the human body.
arXiv Detail & Related papers (2021-11-02T13:40:18Z) - Object-Region Video Transformers [100.23380634952083]
We present Object-Region Transformers Video (ORViT), an emphobject-centric approach that extends transformer video layers with object representations.
Our ORViT block consists of two object-level streams: appearance and dynamics.
We show strong improvement in performance across all tasks and considered, demonstrating the value of a model that incorporates object representations into a transformer architecture.
arXiv Detail & Related papers (2021-10-13T17:51:46Z) - 3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators [24.181604511269096]
We propose an action-conditioned dynamics model that predicts scene changes caused by object and agent interactions in a viewpoint-in 3D neural scene representation space.
In this space, objects do not interfere with one another and their appearance persists over time and across viewpoints.
We show our model generalizes well its predictions across varying number and appearances of interacting objects as well as across camera viewpoints.
arXiv Detail & Related papers (2020-11-12T16:15:52Z) - A Deep Learning Approach to Object Affordance Segmentation [31.221897360610114]
We design an autoencoder that infers pixel-wise affordance labels in both videos and static images.
Our model surpasses the need for object labels and bounding boxes by using a soft-attention mechanism.
We show that our model achieves competitive results compared to strongly supervised methods on SOR3D-AFF.
arXiv Detail & Related papers (2020-04-18T15:34:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.