Open-vocabulary Queryable Scene Representations for Real World Planning
- URL: http://arxiv.org/abs/2209.09874v1
- Date: Tue, 20 Sep 2022 17:29:56 GMT
- Title: Open-vocabulary Queryable Scene Representations for Real World Planning
- Authors: Boyuan Chen and Fei Xia and Brian Ichter and Kanishka Rao and
Keerthana Gopalakrishnan and Michael S. Ryoo and Austin Stone and Daniel
Kappler
- Abstract summary: Large language models (LLMs) have unlocked new capabilities of task planning from human instructions.
However, prior attempts to apply LLMs to real-world robotic tasks are limited by the lack of grounding in the surrounding scene.
We develop NLMap, an open-vocabulary and queryable scene representation to address this problem.
- Score: 56.175724306976505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have unlocked new capabilities of task planning
from human instructions. However, prior attempts to apply LLMs to real-world
robotic tasks are limited by the lack of grounding in the surrounding scene. In
this paper, we develop NLMap, an open-vocabulary and queryable scene
representation to address this problem. NLMap serves as a framework to gather
and integrate contextual information into LLM planners, allowing them to see
and query available objects in the scene before generating a
context-conditioned plan. NLMap first establishes a natural language queryable
scene representation with Visual Language models (VLMs). An LLM based object
proposal module parses instructions and proposes involved objects to query the
scene representation for object availability and location. An LLM planner then
plans with such information about the scene. NLMap allows robots to operate
without a fixed list of objects nor executable options, enabling real robot
operation unachievable by previous methods. Project website:
https://nlmap-saycan.github.io
Related papers
- Tag Map: A Text-Based Map for Spatial Reasoning and Navigation with Large Language Models [15.454856838083511]
Large Language Models (LLM) have emerged as a tool for robots to generate task plans using common sense reasoning.
Recent works have shifted from explicit maps with fixed semantic classes to implicit open vocabulary maps.
We propose an explicit text-based map that can represent thousands of semantic classes while easily integrating with LLMs.
arXiv Detail & Related papers (2024-09-23T18:26:19Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
Vision Language Models (VLMs) can process state information as visual-textual prompts and respond with policy decisions in text.
We propose LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as conversations.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language
Model as an Agent [23.134180979449823]
3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment.
We propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline.
Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries.
arXiv Detail & Related papers (2023-09-21T17:59:45Z) - March in Chat: Interactive Prompting for Remote Embodied Referring
Expression [33.64407469423714]
This paper proposes a March-in-Chat (MiC) model that can talk to the LLM on the fly and plan dynamically based on a newly proposed Room-and-Object Aware Scene Perceiver (ROASP)
Our MiC model outperforms the previous state-of-the-art by large margins by SPL and RGSPL metrics on the REVERIE benchmark.
arXiv Detail & Related papers (2023-08-20T03:00:20Z) - Embodied Task Planning with Large Language Models [86.63533340293361]
We propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint.
During inference, we discover the objects in the scene by extending open-vocabulary object detectors to multi-view RGB images collected in different achievable locations.
Experimental results show that the generated plan from our TaPA framework can achieve higher success rate than LLaVA and GPT-3.5 by a sizable margin.
arXiv Detail & Related papers (2023-07-04T17:58:25Z) - A Picture is Worth a Thousand Words: Language Models Plan from Pixels [53.85753597586226]
Planning is an important capability of artificial agents that perform long-horizon tasks in real-world environments.
In this work, we explore the use of pre-trained language models (PLMs) to reason about plan sequences from text instructions in embodied visual environments.
arXiv Detail & Related papers (2023-03-16T02:02:18Z) - ProgPrompt: Generating Situated Robot Task Plans using Large Language
Models [68.57918965060787]
Large language models (LLMs) can be used to score potential next actions during task planning.
We present a programmatic LLM prompt structure that enables plan generation functional across situated environments.
arXiv Detail & Related papers (2022-09-22T20:29:49Z) - Language Models as Zero-Shot Planners: Extracting Actionable Knowledge
for Embodied Agents [111.33545170562337]
We investigate the possibility of grounding high-level tasks, expressed in natural language, to a chosen set of actionable steps.
We find that if pre-trained LMs are large enough and prompted appropriately, they can effectively decompose high-level tasks into low-level plans.
We propose a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions.
arXiv Detail & Related papers (2022-01-18T18:59:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.