On the Efficacy of 3D Point Cloud Reinforcement Learning
- URL: http://arxiv.org/abs/2306.06799v1
- Date: Sun, 11 Jun 2023 22:52:08 GMT
- Title: On the Efficacy of 3D Point Cloud Reinforcement Learning
- Authors: Zhan Ling, Yunchao Yao, Xuanlin Li, Hao Su
- Abstract summary: We focus on 3D point clouds, one of the most common forms of 3D representations.
We systematically investigate design choices for 3D point cloud RL, leading to the development of a robust algorithm for various robotic manipulation and control tasks.
We find that 3D point cloud RL can significantly outperform the 2D counterpart when agent-object / object-object relationship encoding is a key factor.
- Score: 20.4424883945357
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies on visual reinforcement learning (visual RL) have explored the
use of 3D visual representations. However, none of these work has
systematically compared the efficacy of 3D representations with 2D
representations across different tasks, nor have they analyzed 3D
representations from the perspective of agent-object / object-object
relationship reasoning. In this work, we seek answers to the question of when
and how do 3D neural networks that learn features in the 3D-native space
provide a beneficial inductive bias for visual RL. We specifically focus on 3D
point clouds, one of the most common forms of 3D representations. We
systematically investigate design choices for 3D point cloud RL, leading to the
development of a robust algorithm for various robotic manipulation and control
tasks. Furthermore, through comparisons between 2D image vs 3D point cloud RL
methods on both minimalist synthetic tasks and complex robotic manipulation
tasks, we find that 3D point cloud RL can significantly outperform the 2D
counterpart when agent-object / object-object relationship encoding is a key
factor.
Related papers
- BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence [11.91274849875519]
We introduce a novel image-centric 3D perception model, BIP3D, to overcome the limitations of point-centric methods.
We leverage pre-trained 2D vision foundation models to enhance semantic understanding, and introduce a spatial enhancer module to improve spatial understanding.
In our experiments, BIP3D outperforms current state-of-the-art results on the EmbodiedScan benchmark, achieving improvements of 5.69% in the 3D detection task and 15.25% in the 3D visual grounding task.
arXiv Detail & Related papers (2024-11-22T11:35:42Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation [18.964403296437027]
Act3D represents the robot's workspace using a 3D feature field with adaptive resolutions dependent on the task at hand.
It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling.
arXiv Detail & Related papers (2023-06-30T17:34:06Z) - Multi-View Representation is What You Need for Point-Cloud Pre-Training [22.55455166875263]
This paper proposes a novel approach to point-cloud pre-training that learns 3D representations by leveraging pre-trained 2D networks.
We train the 3D feature extraction network with the help of the novel 2D knowledge transfer loss.
Experimental results demonstrate that our pre-trained model can be successfully transferred to various downstream tasks.
arXiv Detail & Related papers (2023-06-05T03:14:54Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z) - Deep Generative Models on 3D Representations: A Survey [81.73385191402419]
Generative models aim to learn the distribution of observed data by generating new instances.
Recently, researchers started to shift focus from 2D to 3D space.
representing 3D data poses significantly greater challenges.
arXiv Detail & Related papers (2022-10-27T17:59:50Z) - Voxel-based 3D Detection and Reconstruction of Multiple Objects from a
Single Image [22.037472446683765]
We learn a regular grid of 3D voxel features from the input image which is aligned with 3D scene space via a 3D feature lifting operator.
Based on the 3D voxel features, our novel CenterNet-3D detection head formulates the 3D detection as keypoint detection in the 3D space.
We devise an efficient coarse-to-fine reconstruction module, including coarse-level voxelization and a novel local PCA-SDF shape representation.
arXiv Detail & Related papers (2021-11-04T18:30:37Z) - Spatio-temporal Self-Supervised Representation Learning for 3D Point
Clouds [96.9027094562957]
We introduce a-temporal representation learning framework, capable of learning from unlabeled tasks.
Inspired by how infants learn from visual data in the wild, we explore rich cues derived from the 3D data.
STRL takes two temporally-related frames from a 3D point cloud sequence as the input, transforms it with the spatial data augmentation, and learns the invariant representation self-supervisedly.
arXiv Detail & Related papers (2021-09-01T04:17:11Z) - Unsupervised Learning of Visual 3D Keypoints for Control [104.92063943162896]
Learning sensorimotor control policies from high-dimensional images crucially relies on the quality of the underlying visual representations.
We propose a framework to learn such a 3D geometric structure directly from images in an end-to-end unsupervised manner.
These discovered 3D keypoints tend to meaningfully capture robot joints as well as object movements in a consistent manner across both time and 3D space.
arXiv Detail & Related papers (2021-06-14T17:59:59Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z) - Self-supervised Feature Learning by Cross-modality and Cross-view
Correspondences [32.01548991331616]
This paper presents a novel self-supervised learning approach to learn both 2D image features and 3D point cloud features.
It exploits cross-modality and cross-view correspondences without using any annotated human labels.
The effectiveness of the learned 2D and 3D features is evaluated by transferring them on five different tasks.
arXiv Detail & Related papers (2020-04-13T02:57:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.