Unsupervised Learning of Visual 3D Keypoints for Control
- URL: http://arxiv.org/abs/2106.07643v1
- Date: Mon, 14 Jun 2021 17:59:59 GMT
- Title: Unsupervised Learning of Visual 3D Keypoints for Control
- Authors: Boyuan Chen, Pieter Abbeel, Deepak Pathak
- Abstract summary: Learning sensorimotor control policies from high-dimensional images crucially relies on the quality of the underlying visual representations.
We propose a framework to learn such a 3D geometric structure directly from images in an end-to-end unsupervised manner.
These discovered 3D keypoints tend to meaningfully capture robot joints as well as object movements in a consistent manner across both time and 3D space.
- Score: 104.92063943162896
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning sensorimotor control policies from high-dimensional images crucially
relies on the quality of the underlying visual representations. Prior works
show that structured latent space such as visual keypoints often outperforms
unstructured representations for robotic control. However, most of these
representations, whether structured or unstructured are learned in a 2D space
even though the control tasks are usually performed in a 3D environment. In
this work, we propose a framework to learn such a 3D geometric structure
directly from images in an end-to-end unsupervised manner. The input images are
embedded into latent 3D keypoints via a differentiable encoder which is trained
to optimize both a multi-view consistency loss and downstream task objective.
These discovered 3D keypoints tend to meaningfully capture robot joints as well
as object movements in a consistent manner across both time and 3D space. The
proposed approach outperforms prior state-of-art methods across a variety of
reinforcement learning benchmarks. Code and videos at
https://buoyancy99.github.io/unsup-3d-keypoints/
Related papers
- SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - 3D Implicit Transporter for Temporally Consistent Keypoint Discovery [45.152790256675964]
Keypoint-based representation has proven advantageous in various visual and robotic tasks.
The Transporter method was introduced for 2D data, which reconstructs the target frame from the source frame to incorporate both spatial and temporal information.
We propose the first 3D version of the Transporter, which leverages hybrid 3D representation, cross attention, and implicit reconstruction.
arXiv Detail & Related papers (2023-09-10T17:59:48Z) - Multiview Compressive Coding for 3D Reconstruction [77.95706553743626]
We introduce a simple framework that operates on 3D points of single objects or whole scenes.
Our model, Multiview Compressive Coding, learns to compress the input appearance and geometry to predict the 3D structure.
arXiv Detail & Related papers (2023-01-19T18:59:52Z) - BKinD-3D: Self-Supervised 3D Keypoint Discovery from Multi-View Videos [38.16427363571254]
We propose a new method to perform self-supervised keypoint discovery in 3D from multi-view videos of behaving agents.
Our method, BKinD-3D, uses an encoder-decoder architecture with a 3D volumetric heatmap, trained to reconstruct differences across multiple views.
arXiv Detail & Related papers (2022-12-14T18:34:29Z) - Visual Reinforcement Learning with Self-Supervised 3D Representations [15.991546692872841]
We present a unified framework for self-supervised learning of 3D representations for motor control.
Our method enjoys improved sample efficiency in simulated manipulation tasks compared to 2D representation learning methods.
arXiv Detail & Related papers (2022-10-13T17:59:55Z) - PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal
Distillation for 3D Shape Recognition [55.38462937452363]
We propose a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student.
By pair-wise aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification.
arXiv Detail & Related papers (2022-07-07T07:23:20Z) - End-to-End Learning of Multi-category 3D Pose and Shape Estimation [128.881857704338]
We propose an end-to-end method that simultaneously detects 2D keypoints from an image and lifts them to 3D.
The proposed method learns both 2D detection and 3D lifting only from 2D keypoints annotations.
In addition to being end-to-end in image to 3D learning, our method also handles objects from multiple categories using a single neural network.
arXiv Detail & Related papers (2021-12-19T17:10:40Z) - Understanding Pixel-level 2D Image Semantics with 3D Keypoint Knowledge
Engine [56.09471066808409]
We propose a new method on predicting image corresponding semantics in 3D domain and then projecting them back onto 2D images to achieve pixel-level understanding.
We build a large scale keypoint knowledge engine called KeypointNet, which contains 103,450 keypoints and 8,234 3D models from 16 object categories.
arXiv Detail & Related papers (2021-11-21T13:25:20Z) - Voxel-based 3D Detection and Reconstruction of Multiple Objects from a
Single Image [22.037472446683765]
We learn a regular grid of 3D voxel features from the input image which is aligned with 3D scene space via a 3D feature lifting operator.
Based on the 3D voxel features, our novel CenterNet-3D detection head formulates the 3D detection as keypoint detection in the 3D space.
We devise an efficient coarse-to-fine reconstruction module, including coarse-level voxelization and a novel local PCA-SDF shape representation.
arXiv Detail & Related papers (2021-11-04T18:30:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.