Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models
- URL: http://arxiv.org/abs/2502.19417v1
- Date: Wed, 26 Feb 2025 18:58:41 GMT
- Title: Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models
- Authors: Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, Adrian Li-Bell, Danny Driess, Lachy Groom, Sergey Levine, Chelsea Finn,
- Abstract summary: Generalist robots must be able to process complex instructions, prompts, and even feedback during task execution.<n>We describe a system that uses vision-language models in a hierarchical structure.<n>We evaluate our system across three robotic platforms, including single-arm, dual-arm, and dual-arm mobile robots.
- Score: 76.1979254112106
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generalist robots that can perform a range of different tasks in open-world settings must be able to not only reason about the steps needed to accomplish their goals, but also process complex instructions, prompts, and even feedback during task execution. Intricate instructions (e.g., "Could you make me a vegetarian sandwich?" or "I don't like that one") require not just the ability to physically perform the individual steps, but the ability to situate complex commands and feedback in the physical world. In this work, we describe a system that uses vision-language models in a hierarchical structure, first reasoning over complex prompts and user feedback to deduce the most appropriate next step to fulfill the task, and then performing that step with low-level actions. In contrast to direct instruction following methods that can fulfill simple commands ("pick up the cup"), our system can reason through complex prompts and incorporate situated feedback during task execution ("that's not trash"). We evaluate our system across three robotic platforms, including single-arm, dual-arm, and dual-arm mobile robots, demonstrating its ability to handle tasks such as cleaning messy tables, making sandwiches, and grocery shopping.
Related papers
- Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models [21.72355258499675]
We present Manual2Skill, a novel framework that enables robots to perform complex assembly tasks guided by high-level manual instructions.<n>Our approach leverages a Vision-Language Model (VLM) to extract structured information from instructional images.<n>We demonstrate the effectiveness of Manual2Skill by successfully assembling several real-world IKEA furniture items.
arXiv Detail & Related papers (2025-02-14T11:25:24Z) - COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models [49.24666980374751]
COHERENT is a novel LLM-based task planning framework for collaboration of heterogeneous multi-robot systems.
A Proposal-Execution-Feedback-Adjustment mechanism is designed to decompose and assign actions for individual robots.
The experimental results show that our work surpasses the previous methods by a large margin in terms of success rate and execution efficiency.
arXiv Detail & Related papers (2024-09-23T15:53:41Z) - Enabling robots to follow abstract instructions and complete complex dynamic tasks [4.514939211420443]
We present a novel framework that combines Large Language Models, a curated Knowledge Base, and Integrated Force and Visual Feedback (IFVF)
Our approach interprets abstract instructions, performs long-horizon tasks, and handles various uncertainties.
Our findings are illustrated in an accompanying video and supported by an open-source GitHub repository.
arXiv Detail & Related papers (2024-06-17T05:55:35Z) - NaturalVLM: Leveraging Fine-grained Natural Language for
Affordance-Guided Visual Manipulation [21.02437461550044]
Many real-world tasks demand intricate multi-step reasoning.
We introduce a benchmark, NrVLM, comprising 15 distinct manipulation tasks.
We propose a novel learning framework that completes the manipulation task step-by-step according to the fine-grained instructions.
arXiv Detail & Related papers (2024-03-13T09:12:16Z) - Verifiably Following Complex Robot Instructions with Foundation Models [16.564788361518197]
People want to flexibly express constraints, refer to arbitrary landmarks and verify when instructing robots.
We propose Language Instruction grounding for Motion Planning (LIM), an approach that enables robots to verifiably follow expressive and complex open-ended instructions.
LIM constructs a symbolic instruction representation that reveals the robot's alignment with an instructor's intended.
arXiv Detail & Related papers (2024-02-18T08:05:54Z) - Interactive Task Planning with Language Models [89.5839216871244]
An interactive robot framework accomplishes long-horizon task planning and can easily generalize to new goals and distinct tasks, even during execution.<n>Recent large language model based approaches can allow for more open-ended planning but often require heavy prompt engineering or domain specific pretrained models.<n>We propose a simple framework that achieves interactive task planning with language models by incorporating both high-level planning and low-level skill execution.
arXiv Detail & Related papers (2023-10-16T17:59:12Z) - Using Both Demonstrations and Language Instructions to Efficiently Learn
Robotic Tasks [21.65346551790888]
DeL-TaCo is a method for conditioning a robotic policy on task embeddings comprised of two components: a visual demonstration and a language instruction.
To our knowledge, this is the first work to show that simultaneously conditioning a multi-task robotic manipulation policy on both demonstration and language embeddings improves sample efficiency and generalization over conditioning on either modality alone.
arXiv Detail & Related papers (2022-10-10T08:06:58Z) - VIMA: General Robot Manipulation with Multimodal Prompts [82.01214865117637]
We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts.
We develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks.
We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively.
arXiv Detail & Related papers (2022-10-06T17:50:11Z) - Decomposed Prompting: A Modular Approach for Solving Complex Tasks [55.42850359286304]
We propose Decomposed Prompting to solve complex tasks by decomposing them (via prompting) into simpler sub-tasks.
This modular structure allows each prompt to be optimized for its specific sub-task.
We show that the flexibility and modularity of Decomposed Prompting allows it to outperform prior work on few-shot prompting.
arXiv Detail & Related papers (2022-10-05T17:28:20Z) - ProgPrompt: Generating Situated Robot Task Plans using Large Language
Models [68.57918965060787]
Large language models (LLMs) can be used to score potential next actions during task planning.
We present a programmatic LLM prompt structure that enables plan generation functional across situated environments.
arXiv Detail & Related papers (2022-09-22T20:29:49Z) - DeComplex: Task planning from complex natural instructions by a
collocating robot [3.158346511479111]
It is not trivial to execute the human intended tasks as natural language expressions can have large linguistic variations.
Existing works assume either single task instruction is given to the robot at a time or there are multiple independent tasks in an instruction.
We propose a method to find the intended order of execution of multiple inter-dependent tasks given in natural language instruction.
arXiv Detail & Related papers (2020-08-23T18:10:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.