Composing Pick-and-Place Tasks By Grounding Language
- URL: http://arxiv.org/abs/2102.08094v1
- Date: Tue, 16 Feb 2021 11:29:09 GMT
- Title: Composing Pick-and-Place Tasks By Grounding Language
- Authors: Oier Mees, Wolfram Burgard
- Abstract summary: We present a robot system that follows unconstrained language instructions to pick and place arbitrary objects.
Our approach infers objects and their relationships from input images and language expressions.
Results obtained using a real-world PR2 robot demonstrate the effectiveness of our method.
- Score: 41.075844857146805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Controlling robots to perform tasks via natural language is one of the most
challenging topics in human-robot interaction. In this work, we present a robot
system that follows unconstrained language instructions to pick and place
arbitrary objects and effectively resolves ambiguities through dialogues. Our
approach infers objects and their relationships from input images and language
expressions and can place objects in accordance with the spatial relations
expressed by the user. Unlike previous approaches, we consider grounding not
only for the picking but also for the placement of everyday objects from
language. Specifically, by grounding objects and their spatial relations, we
allow specification of complex placement instructions, e.g. "place it behind
the middle red bowl". Our results obtained using a real-world PR2 robot
demonstrate the effectiveness of our method in understanding pick-and-place
language instructions and sequentially composing them to solve tabletop
manipulation tasks. Videos are available at
http://speechrobot.cs.uni-freiburg.de
Related papers
- Context-Aware Command Understanding for Tabletop Scenarios [1.7082212774297747]
This paper presents a novel hybrid algorithm designed to interpret natural human commands in tabletop scenarios.
By integrating multiple sources of information, including speech, gestures, and scene context, the system extracts actionable instructions for a robot.
We discuss the strengths and limitations of the system, with particular focus on how it handles multimodal command interpretation.
arXiv Detail & Related papers (2024-10-08T20:46:39Z) - Object-Centric Instruction Augmentation for Robotic Manipulation [29.491990994901666]
We introduce the textitObject-Centric Instruction Augmentation (OCI) framework to augment highly semantic and information-dense language instruction with position cues.
We utilize a Multi-modal Large Language Model (MLLM) to weave knowledge of object locations into natural language instruction.
We demonstrate that robotic manipulator imitation policies trained with our enhanced instructions outperform those relying solely on traditional language instructions.
arXiv Detail & Related papers (2024-01-05T13:54:45Z) - WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model [92.90127398282209]
This paper investigates the potential of integrating the most recent Large Language Models (LLMs) and existing visual grounding and robotic grasping system.
We introduce the WALL-E (Embodied Robotic WAiter load lifting with Large Language model) as an example of this integration.
We deploy this LLM-empowered system on the physical robot to provide a more user-friendly interface for the instruction-guided grasping task.
arXiv Detail & Related papers (2023-08-30T11:35:21Z) - Open-World Object Manipulation using Pre-trained Vision-Language Models [72.87306011500084]
For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary.
We develop a simple approach, which leverages a pre-trained vision-language model to extract object-identifying information.
In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments.
arXiv Detail & Related papers (2023-03-02T01:55:10Z) - Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [119.29555551279155]
Large language models can encode a wealth of semantic knowledge about the world.
Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language.
We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions.
arXiv Detail & Related papers (2022-04-04T17:57:11Z) - Learning Language-Conditioned Robot Behavior from Offline Data and
Crowd-Sourced Annotation [80.29069988090912]
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction.
We propose to leverage offline robot datasets with crowd-sourced natural language labels.
We find that our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%.
arXiv Detail & Related papers (2021-09-02T17:42:13Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Spatial Reasoning from Natural Language Instructions for Robot
Manipulation [0.5033155053523041]
We propose a pipelined architecture of two stages to perform spatial reasoning on the text input.
All the objects in the scene are first localized, and then the instruction for the robot in natural language and the localized co-ordinates are mapped to the start and end co-ordinates.
The proposed method is used to pick-and-place playing cards using a robot arm.
arXiv Detail & Related papers (2020-12-26T07:53:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.