PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation
- URL: http://arxiv.org/abs/2309.15596v1
- Date: Wed, 27 Sep 2023 11:50:43 GMT
- Title: PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation
- Authors: Shizhe Chen, Ricardo Garcia, Cordelia Schmid, Ivan Laptev
- Abstract summary: PolarNet is a 3D point cloud based policy for language-guided manipulation.
It learns 3D point cloud representations and integrate them with language instructions for action prediction.
It outperforms state-of-the-art 2D and 3D approaches in both single-task and multi-task learning.
- Score: 93.46306666726969
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ability for robots to comprehend and execute manipulation tasks based on
natural language instructions is a long-term goal in robotics. The dominant
approaches for language-guided manipulation use 2D image representations, which
face difficulties in combining multi-view cameras and inferring precise 3D
positions and relationships. To address these limitations, we propose a 3D
point cloud based policy called PolarNet for language-guided manipulation. It
leverages carefully designed point cloud inputs, efficient point cloud
encoders, and multimodal transformers to learn 3D point cloud representations
and integrate them with language instructions for action prediction. PolarNet
is shown to be effective and data efficient in a variety of experiments
conducted on the RLBench benchmark. It outperforms state-of-the-art 2D and 3D
approaches in both single-task and multi-task learning. It also achieves
promising results on a real robot.
Related papers
- VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos.
VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z) - 3D-MVP: 3D Multiview Pretraining for Robotic Manipulation [53.45111493465405]
We propose 3D-MVP, a novel approach for 3D multi-view pretraining using masked autoencoders.
We leverage Robotic View Transformer (RVT), which uses a multi-view transformer to understand the 3D scene and predict pose actions.
We show promising results on a real robot platform with minimal finetuning.
arXiv Detail & Related papers (2024-06-26T08:17:59Z) - Transcrib3D: 3D Referring Expression Resolution through Large Language Models [28.121606686759225]
We introduce Transcrib3D, an approach that brings together 3D detection methods and the emergent reasoning capabilities of large language models.
Transcrib3D achieves state-of-the-art results on 3D reference resolution benchmarks.
We show that our method enables a real robot to perform pick-and-place tasks given queries that contain challenging referring expressions.
arXiv Detail & Related papers (2024-04-30T02:48:20Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - PointLLM: Empowering Large Language Models to Understand Point Clouds [63.39876878899682]
PointLLM understands colored object point clouds with human instructions.
It generates contextually appropriate responses, illustrating its grasp of point clouds and common sense.
arXiv Detail & Related papers (2023-08-31T17:59:46Z) - Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation [18.964403296437027]
Act3D represents the robot's workspace using a 3D feature field with adaptive resolutions dependent on the task at hand.
It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling.
arXiv Detail & Related papers (2023-06-30T17:34:06Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z) - Learning 6-DoF Fine-grained Grasp Detection Based on Part Affordance Grounding [42.04502185508723]
We propose a new large Language-guided SHape grAsPing datasEt to promote 3D part-level affordance and grasping ability learning.
From the perspective of robotic cognition, we design a two-stage fine-grained robotic grasping framework (named LangPartGPD)
Our method combines the advantages of human-robot collaboration and large language models (LLMs)
Results show our method achieves competitive performance in 3D geometry fine-grained grounding, object affordance inference, and 3D part-aware grasping tasks.
arXiv Detail & Related papers (2023-01-27T07:00:54Z) - Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for
3D Visual Grounding [23.672405624011873]
We propose a module to consolidate the 3D visual stream by 2D clues synthesized from point clouds.
We empirically show their aptitude to boost the quality of the learned visual representations.
Our proposed module, dubbed as Look Around and Refer (LAR), significantly outperforms the state-of-the-art 3D visual grounding techniques on three benchmarks.
arXiv Detail & Related papers (2022-11-25T17:12:08Z) - Unsupervised Learning of Fine Structure Generation for 3D Point Clouds
by 2D Projection Matching [66.98712589559028]
We propose an unsupervised approach for 3D point cloud generation with fine structures.
Our method can recover fine 3D structures from 2D silhouette images at different resolutions.
arXiv Detail & Related papers (2021-08-08T22:15:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.