Extracting Zero-shot Common Sense from Large Language Models for Robot
3D Scene Understanding
- URL: http://arxiv.org/abs/2206.04585v1
- Date: Thu, 9 Jun 2022 16:05:35 GMT
- Title: Extracting Zero-shot Common Sense from Large Language Models for Robot
3D Scene Understanding
- Authors: William Chen, Siyi Hu, Rajat Talak, Luca Carlone
- Abstract summary: We introduce a novel method for leveraging common sense embedded within large language models for labelling rooms.
The proposed algorithm operates on 3D scene graphs produced by modern spatial perception systems.
- Score: 25.270772036342688
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Semantic 3D scene understanding is a problem of critical importance in
robotics. While significant advances have been made in simultaneous
localization and mapping algorithms, robots are still far from having the
common sense knowledge about household objects and their locations of an
average human. We introduce a novel method for leveraging common sense embedded
within large language models for labelling rooms given the objects contained
within. This algorithm has the added benefits of (i) requiring no task-specific
pre-training (operating entirely in the zero-shot regime) and (ii) generalizing
to arbitrary room and object labels, including previously-unseen ones -- both
of which are highly desirable traits in robotic scene understanding algorithms.
The proposed algorithm operates on 3D scene graphs produced by modern spatial
perception systems, and we hope it will pave the way to more generalizable and
scalable high-level 3D scene understanding for robotics.
Related papers
- SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature
Aligned Pre-Training and Region-Aware Fine-tuning [55.517000360348725]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.
To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.
Experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and
Planning [125.90002884194838]
ConceptGraphs is an open-vocabulary graph-structured representation for 3D scenes.
It is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association.
We demonstrate the utility of this representation through a number of downstream planning tasks.
arXiv Detail & Related papers (2023-09-28T17:53:38Z) - SG-Bot: Object Rearrangement via Coarse-to-Fine Robotic Imagination on Scene Graphs [81.15889805560333]
We present SG-Bot, a novel rearrangement framework.
SG-Bot exemplifies lightweight, real-time, and user-controllable characteristics.
Experimental results demonstrate that SG-Bot outperforms competitors by a large margin.
arXiv Detail & Related papers (2023-09-21T15:54:33Z) - ScanERU: Interactive 3D Visual Grounding based on Embodied Reference
Understanding [67.21613160846299]
Embodied Reference Understanding (ERU) is first designed for this concern.
New dataset called ScanERU is constructed to evaluate the effectiveness of this idea.
arXiv Detail & Related papers (2023-03-23T11:36:14Z) - Learning 6-DoF Fine-grained Grasp Detection Based on Part Affordance Grounding [42.04502185508723]
We propose a new large Language-guided SHape grAsPing datasEt to promote 3D part-level affordance and grasping ability learning.
From the perspective of robotic cognition, we design a two-stage fine-grained robotic grasping framework (named LangPartGPD)
Our method combines the advantages of human-robot collaboration and large language models (LLMs)
Results show our method achieves competitive performance in 3D geometry fine-grained grounding, object affordance inference, and 3D part-aware grasping tasks.
arXiv Detail & Related papers (2023-01-27T07:00:54Z) - Generalized Object Search [0.9137554315375919]
This thesis develops methods and systems for (multi-)object search in 3D environments under uncertainty.
I implement a robot-independent, environment-agnostic system for generalized object search in 3D.
I deploy it on the Boston Dynamics Spot robot, the Kinova MOVO robot, and the Universal Robots UR5e robotic arm.
arXiv Detail & Related papers (2023-01-24T16:41:36Z) - Leveraging Large (Visual) Language Models for Robot 3D Scene
Understanding [25.860680905256174]
We investigate the use of pre-trained language models to impart common sense for scene understanding.
We find that the best approaches in both categories yield $sim 70%$ room classification accuracy.
arXiv Detail & Related papers (2022-09-12T21:36:58Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z) - Learning to Reconstruct and Segment 3D Objects [4.709764624933227]
We aim to understand scenes and the objects within them by learning general and robust representations using deep neural networks.
This thesis makes three core contributions from object-level 3D shape estimation from single or multiple views to scene-level semantic understanding.
arXiv Detail & Related papers (2020-10-19T15:09:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.