D$^3$Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable
Robotic Manipulation
- URL: http://arxiv.org/abs/2309.16118v2
- Date: Sun, 8 Oct 2023 21:17:51 GMT
- Title: D$^3$Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable
Robotic Manipulation
- Authors: Yixuan Wang, Zhuoran Li, Mingtong Zhang, Katherine Driggs-Campbell,
Jiajun Wu, Li Fei-Fei, Yunzhu Li
- Abstract summary: We introduce D$3$Fields - dynamic 3D descriptor fields.
These fields capture the dynamics of the underlying 3D environment and encode both semantic features and instance masks.
We show that D$3$Fields are both generalizable and effective for zero-shot robotic manipulation tasks.
- Score: 34.31127678066616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene representation has been a crucial design choice in robotic manipulation
systems. An ideal representation should be 3D, dynamic, and semantic to meet
the demands of diverse manipulation tasks. However, previous works often lack
all three properties simultaneously. In this work, we introduce D$^3$Fields -
dynamic 3D descriptor fields. These fields capture the dynamics of the
underlying 3D environment and encode both semantic features and instance masks.
Specifically, we project arbitrary 3D points in the workspace onto multi-view
2D visual observations and interpolate features derived from foundational
models. The resulting fused descriptor fields allow for flexible goal
specifications using 2D images with varied contexts, styles, and instances. To
evaluate the effectiveness of these descriptor fields, we apply our
representation to a wide range of robotic manipulation tasks in a zero-shot
manner. Through extensive evaluation in both real-world scenarios and
simulations, we demonstrate that D$^3$Fields are both generalizable and
effective for zero-shot robotic manipulation tasks. In quantitative comparisons
with state-of-the-art dense descriptors, such as Dense Object Nets and DINO,
D$^3$Fields exhibit significantly better generalization abilities and
manipulation accuracy.
Related papers
- DGD: Dynamic 3D Gaussians Distillation [14.7298711927857]
We tackle the task of learning dynamic 3D semantic radiance fields given a single monocular video as input.
Our learned semantic radiance field captures per-point semantics as well as color and geometric properties for a dynamic 3D scene.
We present DGD, a unified 3D representation for both the appearance and semantics of a dynamic 3D scene.
arXiv Detail & Related papers (2024-05-29T17:52:22Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D
Scene Generation [96.58789785954409]
We propose a practical and efficient 3D representation that incorporates an equivariant radiance field with the guidance of a bird's-eye view map.
We produce large-scale, even infinite-scale, 3D scenes via synthesizing local scenes and then stitching them with smooth consistency.
arXiv Detail & Related papers (2023-12-04T18:56:10Z) - Generating Visual Spatial Description via Holistic 3D Scene
Understanding [88.99773815159345]
Visual spatial description (VSD) aims to generate texts that describe the spatial relations of the given objects within images.
With an external 3D scene extractor, we obtain the 3D objects and scene features for input images.
We construct a target object-centered 3D spatial scene graph (Go3D-S2G), such that we model the spatial semantics of target objects within the holistic 3D scenes.
arXiv Detail & Related papers (2023-05-19T15:53:56Z) - Exploiting the Complementarity of 2D and 3D Networks to Address
Domain-Shift in 3D Semantic Segmentation [14.30113021974841]
3D semantic segmentation is a critical task in many real-world applications, such as autonomous driving, robotics, and mixed reality.
A possible solution is to combine the 3D information with others coming from sensors featuring a different modality, such as RGB cameras.
Recent multi-modal 3D semantic segmentation networks exploit these modalities relying on two branches that process the 2D and 3D information independently.
arXiv Detail & Related papers (2023-04-06T10:59:43Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z) - MvDeCor: Multi-view Dense Correspondence Learning for Fine-grained 3D
Segmentation [91.6658845016214]
We propose to utilize self-supervised techniques in the 2D domain for fine-grained 3D shape segmentation tasks.
We render a 3D shape from multiple views, and set up a dense correspondence learning task within the contrastive learning framework.
As a result, the learned 2D representations are view-invariant and geometrically consistent.
arXiv Detail & Related papers (2022-08-18T00:48:15Z) - Neural Descriptor Fields: SE(3)-Equivariant Object Representations for
Manipulation [75.83319382105894]
We present Neural Descriptor Fields (NDFs), an object representation that encodes both points and relative poses between an object and a target.
NDFs are trained in a self-supervised fashion via a 3D auto-encoding task that does not rely on expert-labeled keypoints.
Our performance generalizes across both object instances and 6-DoF object poses, and significantly outperforms a recent baseline that relies on 2D descriptors.
arXiv Detail & Related papers (2021-12-09T18:57:15Z) - CoCoNets: Continuous Contrastive 3D Scene Representations [21.906643302668716]
This paper explores self-supervised learning of amodal 3D feature representations from RGB and RGB-D posed images and videos.
We show the resulting 3D visual feature representations effectively scale across objects and scenes, imagine information occluded or missing from the input viewpoints, track objects over time, align semantically related objects in 3D, and improve 3D object detection.
arXiv Detail & Related papers (2021-04-08T15:50:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.