Related papers: WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model

WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model

URL: http://arxiv.org/abs/2308.15962v2
Date: Thu, 31 Aug 2023 13:51:56 GMT
Title: WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model
Authors: Tianyu Wang, Yifan Li, Haitao Lin, Xiangyang Xue, Yanwei Fu
Abstract summary: This paper investigates the potential of integrating the most recent Large Language Models (LLMs) and existing visual grounding and robotic grasping system. We introduce the WALL-E (Embodied Robotic WAiter load lifting with Large Language model) as an example of this integration. We deploy this LLM-empowered system on the physical robot to provide a more user-friendly interface for the instruction-guided grasping task.
Score: 92.90127398282209
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Enabling robots to understand language instructions and react accordingly to visual perception has been a long-standing goal in the robotics research community. Achieving this goal requires cutting-edge advances in natural language processing, computer vision, and robotics engineering. Thus, this paper mainly investigates the potential of integrating the most recent Large Language Models (LLMs) and existing visual grounding and robotic grasping system to enhance the effectiveness of the human-robot interaction. We introduce the WALL-E (Embodied Robotic WAiter load lifting with Large Language model) as an example of this integration. The system utilizes the LLM of ChatGPT to summarize the preference object of the users as a target instruction via the multi-round interactive dialogue. The target instruction is then forwarded to a visual grounding system for object pose and size estimation, following which the robot grasps the object accordingly. We deploy this LLM-empowered system on the physical robot to provide a more user-friendly interface for the instruction-guided grasping task. The further experimental results on various real-world scenarios demonstrated the feasibility and efficacy of our proposed framework. See the project website at: https://star-uu-wang.github.io/WALL-E/

Related papers

TalkWithMachines: Enhancing Human-Robot Interaction for Interpretable Industrial Robotics Through Large/Vision Language Models [1.534667887016089]
The presented paper investigates recent advancements in Large Language Models (LLMs) and Vision Language Models (VLMs) This integration allows robots to understand and execute commands given in natural language and to perceive their environment through visual and/or descriptive inputs. Our paper outlines four LLM-assisted simulated robotic control, which explore (i) low-level control, (ii) the generation of language-based feedback that describes the robot's internal states, (iii) the use of visual information as additional input, and (iv) the use of robot structure information for generating task plans and feedback.
arXiv Detail & Related papers (2024-12-19T23:43:40Z)
$π_0$: A Vision-Language-Action Flow Model for General Robot Control [77.32743739202543]
We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people, and its ability to acquire new skills via fine-tuning.
arXiv Detail & Related papers (2024-10-31T17:22:30Z)
Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models [53.22792173053473]
We introduce an interactive robotic manipulation framework called Polaris. Polaris integrates perception and interaction by utilizing GPT-4 alongside grounded vision models. We propose a novel Synthetic-to-Real (Syn2Real) pose estimation pipeline.
arXiv Detail & Related papers (2024-08-15T06:40:38Z)
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets. We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z)
RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation [77.41969287400977]
This paper presents textbfRobotScript, a platform for a deployable robot manipulation pipeline powered by code generation. We also present a benchmark for a code generation benchmark for robot manipulation tasks in free-form natural language. We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms.
arXiv Detail & Related papers (2024-02-22T15:12:00Z)
Interactive Planning Using Large Language Models for Partially Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks. We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z)
Large Language Models for Robotics: A Survey [40.76581696885846]
Large language models (LLMs) possess the ability to process and generate natural language, facilitating efficient interaction and collaboration with robots. This review aims to summarize the applications of LLMs in robotics, delving into their impact and contributions to key areas such as robot control, perception, decision-making, and path planning.
arXiv Detail & Related papers (2023-11-13T10:46:35Z)
Prompt a Robot to Walk with Large Language Models [18.214609570837403]
Large language models (LLMs) pre-trained on vast internet-scale data have showcased remarkable capabilities across diverse domains. We introduce a novel paradigm in which we use few-shot prompts collected from the physical environment. Experiments across various robots and environments validate that our method can effectively prompt a robot to walk.
arXiv Detail & Related papers (2023-09-18T17:50:17Z)
PROGrasp: Pragmatic Human-Robot Communication for Object Grasping [22.182690439449278]
Interactive Object Grasping (IOG) is the task of identifying and grasping the desired object via human-robot natural language interaction. Inspired by pragmatics, we introduce a new IOG task, Pragmatic-IOG, and the corresponding dataset, Intention-oriented Multi-modal Dialogue (IM-Dial) Prograsp performs Pragmatic-IOG by incorporating modules for visual grounding, question asking, object grasping, and most importantly, answer interpretation for pragmatic inference.
arXiv Detail & Related papers (2023-09-14T14:45:47Z)
Incremental Learning of Humanoid Robot Behavior from Natural Interaction and Large Language Models [23.945922720555146]
We propose a system to achieve incremental learning of complex behavior from natural interaction. We integrate the system in the robot cognitive architecture of the humanoid robot ARMAR-6.
arXiv Detail & Related papers (2023-09-08T13:29:05Z)
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control. Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.