Open-vocabulary Queryable Scene Representations for Real World Planning
- URL: http://arxiv.org/abs/2209.09874v1
- Date: Tue, 20 Sep 2022 17:29:56 GMT
- Title: Open-vocabulary Queryable Scene Representations for Real World Planning
- Authors: Boyuan Chen and Fei Xia and Brian Ichter and Kanishka Rao and
Keerthana Gopalakrishnan and Michael S. Ryoo and Austin Stone and Daniel
Kappler
- Abstract summary: Large language models (LLMs) have unlocked new capabilities of task planning from human instructions.
However, prior attempts to apply LLMs to real-world robotic tasks are limited by the lack of grounding in the surrounding scene.
We develop NLMap, an open-vocabulary and queryable scene representation to address this problem.
- Score: 56.175724306976505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have unlocked new capabilities of task planning
from human instructions. However, prior attempts to apply LLMs to real-world
robotic tasks are limited by the lack of grounding in the surrounding scene. In
this paper, we develop NLMap, an open-vocabulary and queryable scene
representation to address this problem. NLMap serves as a framework to gather
and integrate contextual information into LLM planners, allowing them to see
and query available objects in the scene before generating a
context-conditioned plan. NLMap first establishes a natural language queryable
scene representation with Visual Language models (VLMs). An LLM based object
proposal module parses instructions and proposes involved objects to query the
scene representation for object availability and location. An LLM planner then
plans with such information about the scene. NLMap allows robots to operate
without a fixed list of objects nor executable options, enabling real robot
operation unachievable by previous methods. Project website:
https://nlmap-saycan.github.io
Related papers
- Details Make a Difference: Object State-Sensitive Neurorobotic Task Planning [15.03025428687218]
Object State-Sensitive Agent (OSSA) is a task-planning agent empowered by pre-trained neural networks.
We propose two methods for OSSA: (i) a modular model consisting of a pre-trained vision processing module and a natural language processing model (LLM), and (ii) a monolithic model consisting only of a VLM.
Our results show that both methods can be used for object state-sensitive tasks, but the monolithic approach outperforms the modular approach.
arXiv Detail & Related papers (2024-06-14T12:52:42Z) - Exploring Unseen Environments with Robots using Large Language and Vision Models through a Procedurally Generated 3D Scene Representation [0.979851640406258]
This work focuses on solving the object goal navigation problem by mimicking human cognition.
We introduce a comprehensive framework capable of exploring an unfamiliar environment in search of an object.
A challenging task in using LLMs to generate high level sub-goals is to efficiently represent the environment around the robot.
arXiv Detail & Related papers (2024-03-30T10:54:59Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language
Model as an Agent [23.134180979449823]
3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment.
We propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline.
Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries.
arXiv Detail & Related papers (2023-09-21T17:59:45Z) - March in Chat: Interactive Prompting for Remote Embodied Referring
Expression [33.64407469423714]
This paper proposes a March-in-Chat (MiC) model that can talk to the LLM on the fly and plan dynamically based on a newly proposed Room-and-Object Aware Scene Perceiver (ROASP)
Our MiC model outperforms the previous state-of-the-art by large margins by SPL and RGSPL metrics on the REVERIE benchmark.
arXiv Detail & Related papers (2023-08-20T03:00:20Z) - Embodied Task Planning with Large Language Models [86.63533340293361]
We propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint.
During inference, we discover the objects in the scene by extending open-vocabulary object detectors to multi-view RGB images collected in different achievable locations.
Experimental results show that the generated plan from our TaPA framework can achieve higher success rate than LLaVA and GPT-3.5 by a sizable margin.
arXiv Detail & Related papers (2023-07-04T17:58:25Z) - A Picture is Worth a Thousand Words: Language Models Plan from Pixels [53.85753597586226]
Planning is an important capability of artificial agents that perform long-horizon tasks in real-world environments.
In this work, we explore the use of pre-trained language models (PLMs) to reason about plan sequences from text instructions in embodied visual environments.
arXiv Detail & Related papers (2023-03-16T02:02:18Z) - ProgPrompt: Generating Situated Robot Task Plans using Large Language
Models [68.57918965060787]
Large language models (LLMs) can be used to score potential next actions during task planning.
We present a programmatic LLM prompt structure that enables plan generation functional across situated environments.
arXiv Detail & Related papers (2022-09-22T20:29:49Z) - Language Models as Zero-Shot Planners: Extracting Actionable Knowledge
for Embodied Agents [111.33545170562337]
We investigate the possibility of grounding high-level tasks, expressed in natural language, to a chosen set of actionable steps.
We find that if pre-trained LMs are large enough and prompted appropriately, they can effectively decompose high-level tasks into low-level plans.
We propose a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions.
arXiv Detail & Related papers (2022-01-18T18:59:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.