Related papers: Chat with the Environment: Interactive Multimodal Perception Using Large Language Models

Chat with the Environment: Interactive Multimodal Perception Using Large Language Models

URL: http://arxiv.org/abs/2303.08268v3
Date: Wed, 11 Oct 2023 16:17:20 GMT
Title: Chat with the Environment: Interactive Multimodal Perception Using Large Language Models
Authors: Xufeng Zhao, Mengdi Li, Cornelius Weber, Muhammad Burhan Hafez, and Stefan Wermter
Abstract summary: Large Language Models (LLMs) have shown remarkable reasoning ability in few-shot robotic planning. Our study demonstrates that LLMs can provide high-level planning and reasoning skills and control interactive robot behavior in a multimodal environment.
Score: 19.623070762485494
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Programming robot behavior in a complex world faces challenges on multiple levels, from dextrous low-level skills to high-level planning and reasoning. Recent pre-trained Large Language Models (LLMs) have shown remarkable reasoning ability in few-shot robotic planning. However, it remains challenging to ground LLMs in multimodal sensory input and continuous action output, while enabling a robot to interact with its environment and acquire novel information as its policies unfold. We develop a robot interaction scenario with a partially observable state, which necessitates a robot to decide on a range of epistemic actions in order to sample sensory information among multiple modalities, before being able to execute the task correctly. Matcha (Multimodal environment chatting) agent, an interactive perception framework, is therefore proposed with an LLM as its backbone, whose ability is exploited to instruct epistemic actions and to reason over the resulting multimodal sensations (vision, sound, haptics, proprioception), as well as to plan an entire task execution based on the interactively acquired information. Our study demonstrates that LLMs can provide high-level planning and reasoning skills and control interactive robot behavior in a multimodal environment, while multimodal modules with the context of the environmental state help ground the LLMs and extend their processing ability. The project website can be found at https://matcha-agent.github.io.

Related papers

RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete [27.814422322892522]
Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various multimodal contexts. They lack three essential robotic brain capabilities: Planning Capability, Affordance Perception, and Trajectory Prediction. We introduce ShareRobot, a dataset that labels multi-dimensional information such as task planning, object affordance, and end-effector trajectory. We develop RoboBrain, an MLLM-based model that combines robotic and general multi-modal data, utilizing a multi-stage training strategy.
arXiv Detail & Related papers (2025-02-28T17:30:39Z)
Commonsense Reasoning for Legged Robot Adaptation with Vision-Language Models [81.55156507635286]
Legged robots are physically capable of navigating a diverse variety of environments and overcoming a wide range of obstructions. Current learning methods often struggle with generalization to the long tail of unexpected situations without heavy human supervision. We propose a system, VLM-Predictive Control (VLM-PC), combining two key components that we find to be crucial for eliciting on-the-fly, adaptive behavior selection.
arXiv Detail & Related papers (2024-07-02T21:00:30Z)
MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments [82.67236400004826]
We introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions. MEM module enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities.
arXiv Detail & Related papers (2024-02-01T02:43:20Z)
Large Language Models for Robotics: Opportunities, Challenges, and Perspectives [46.57277568357048]
Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. For embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. We propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions.
arXiv Detail & Related papers (2024-01-09T03:22:16Z)
Interactive Planning Using Large Language Models for Partially Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks. We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z)
Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts. This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals. We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z)
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation. We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects. We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z)
Language to Rewards for Robotic Skill Synthesis [37.21434094015743]
We introduce a new paradigm that harnesses large language models (LLMs) to define reward parameters that can be optimized and accomplish variety of robotic tasks. Using reward as the intermediate interface generated by LLMs, we can effectively bridge the gap between high-level language instructions or corrections to low-level robot actions.
arXiv Detail & Related papers (2023-06-14T17:27:10Z)
PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks. Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z)
Inner Monologue: Embodied Reasoning through Planning with Language Models [81.07216635735571]
Large Language Models (LLMs) can be applied to domains beyond natural language processing. LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them. We propose that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios.
arXiv Detail & Related papers (2022-07-12T15:20:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.