Enabling robots to follow abstract instructions and complete complex dynamic tasks
- URL: http://arxiv.org/abs/2406.11231v1
- Date: Mon, 17 Jun 2024 05:55:35 GMT
- Title: Enabling robots to follow abstract instructions and complete complex dynamic tasks
- Authors: Ruaridh Mon-Williams, Gen Li, Ran Long, Wenqian Du, Chris Lucas,
- Abstract summary: We present a novel framework that combines Large Language Models, a curated Knowledge Base, and Integrated Force and Visual Feedback (IFVF)
Our approach interprets abstract instructions, performs long-horizon tasks, and handles various uncertainties.
Our findings are illustrated in an accompanying video and supported by an open-source GitHub repository.
- Score: 4.514939211420443
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Completing complex tasks in unpredictable settings like home kitchens challenges robotic systems. These challenges include interpreting high-level human commands, such as "make me a hot beverage" and performing actions like pouring a precise amount of water into a moving mug. To address these challenges, we present a novel framework that combines Large Language Models (LLMs), a curated Knowledge Base, and Integrated Force and Visual Feedback (IFVF). Our approach interprets abstract instructions, performs long-horizon tasks, and handles various uncertainties. It utilises GPT-4 to analyse the user's query and surroundings, then generates code that accesses a curated database of functions during execution. It translates abstract instructions into actionable steps. Each step involves generating custom code by employing retrieval-augmented generalisation to pull IFVF-relevant examples from the Knowledge Base. IFVF allows the robot to respond to noise and disturbances during execution. We use coffee making and plate decoration to demonstrate our approach, including components ranging from pouring to drawer opening, each benefiting from distinct feedback types and methods. This novel advancement marks significant progress toward a scalable, efficient robotic framework for completing complex tasks in uncertain environments. Our findings are illustrated in an accompanying video and supported by an open-source GitHub repository (released upon paper acceptance).
Related papers
- DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control [53.80518003412016]
Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research.
We study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair.
We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls.
arXiv Detail & Related papers (2024-07-20T05:39:28Z) - ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning [74.58666091522198]
We present a framework for intuitive robot programming by non-experts.
We leverage natural language prompts and contextual information from the Robot Operating System (ROS)
Our system integrates large language models (LLMs), enabling non-experts to articulate task requirements to the system through a chat interface.
arXiv Detail & Related papers (2024-06-28T08:28:38Z) - Scaling Instructable Agents Across Many Simulated Worlds [71.1284502230496]
Our goal is to develop an agent that can accomplish anything a human can do in any simulated 3D environment.
Our approach focuses on language-driven generality while imposing minimal assumptions.
Our agents interact with environments in real-time using a generic, human-like interface.
arXiv Detail & Related papers (2024-03-13T17:50:32Z) - NaturalVLM: Leveraging Fine-grained Natural Language for
Affordance-Guided Visual Manipulation [21.02437461550044]
Many real-world tasks demand intricate multi-step reasoning.
We introduce a benchmark, NrVLM, comprising 15 distinct manipulation tasks.
We propose a novel learning framework that completes the manipulation task step-by-step according to the fine-grained instructions.
arXiv Detail & Related papers (2024-03-13T09:12:16Z) - MOKA: Open-Vocabulary Robotic Manipulation through Mark-Based Visual
Prompting [106.53784213239479]
We present MOKA (Marking Open-vocabulary Keypoint Affordances), an approach that employs vision language models to solve robotic manipulation tasks.
At the heart of our approach is a compact point-based representation of affordance and motion that bridges the VLM's predictions on RGB images and the robot's motions in the physical world.
We evaluate and analyze MOKA's performance on a variety of manipulation tasks specified by free-form language descriptions.
arXiv Detail & Related papers (2024-03-05T18:08:45Z) - Verifiably Following Complex Robot Instructions with Foundation Models [16.564788361518197]
People want to flexibly express constraints, refer to arbitrary landmarks and verify when instructing robots.
We propose Language Instruction grounding for Motion Planning (LIM), an approach that enables robots to verifiably follow expressive and complex open-ended instructions.
LIM constructs a symbolic instruction representation that reveals the robot's alignment with an instructor's intended.
arXiv Detail & Related papers (2024-02-18T08:05:54Z) - Fully Automated Task Management for Generation, Execution, and
Evaluation: A Framework for Fetch-and-Carry Tasks with Natural Language
Instructions in Continuous Space [1.2691047660244337]
This paper aims to develop a framework that enables a robot to execute tasks based on visual information.
We propose a framework for the full automation of the generation, execution, and evaluation of FCOG tasks.
In addition, we introduce an approach to solving the FCOG tasks by dividing them into four distinct subtasks.
arXiv Detail & Related papers (2023-11-07T15:38:09Z) - Octopus: Embodied Vision-Language Programmer from Environmental Feedback [59.772904419928054]
Large vision-language models (VLMs) have achieved substantial progress in multimodal perception and reasoning.
In this paper, we introduce Octopus, a novel VLM designed to proficiently decipher an agent's vision and textual task objectives.
Our design allows the agent to adeptly handle a wide spectrum of tasks, ranging from mundane daily chores in simulators to sophisticated interactions in complex video games.
arXiv Detail & Related papers (2023-10-12T17:59:58Z) - AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot
Manipulation [50.737355245505334]
We propose a novel framework for learning high-level cognitive capabilities in robot manipulation tasks.
The resulting dataset AlphaBlock consists of 35 comprehensive high-level tasks of multi-step text plans and paired observation.
arXiv Detail & Related papers (2023-05-30T09:54:20Z) - MOCA: A Modular Object-Centric Approach for Interactive Instruction
Following [19.57344182656879]
We propose a modular architecture that decouples the task into visual perception and action policy.
We evaluate our method on the ALFRED benchmark and empirically validate that it outperforms prior arts.
arXiv Detail & Related papers (2020-12-06T07:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.