ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts
- URL: http://arxiv.org/abs/2308.11236v2
- Date: Wed, 23 Aug 2023 08:31:16 GMT
- Title: ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts
- Authors: Bilel Benjdira, Anis Koubaa, Anas M. Ali
- Abstract summary: We argue that the next generation of robots can be commanded using only Language Models' prompts.
This paper gives this new robotic design pattern the name of: Prompting Robotic Modalities (PRM)
This paper applies this PRM design pattern in building a new robotic framework named ROSGPT_Vision.
- Score: 0.9821874476902969
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this paper, we argue that the next generation of robots can be commanded
using only Language Models' prompts. Every prompt interrogates separately a
specific Robotic Modality via its Modality Language Model (MLM). A central Task
Modality mediates the whole communication to execute the robotic mission via a
Large Language Model (LLM). This paper gives this new robotic design pattern
the name of: Prompting Robotic Modalities (PRM). Moreover, this paper applies
this PRM design pattern in building a new robotic framework named
ROSGPT_Vision. ROSGPT_Vision allows the execution of a robotic task using only
two prompts: a Visual and an LLM prompt. The Visual Prompt extracts, in natural
language, the visual semantic features related to the task under consideration
(Visual Robotic Modality). Meanwhile, the LLM Prompt regulates the robotic
reaction to the visual description (Task Modality). The framework automates all
the mechanisms behind these two prompts. The framework enables the robot to
address complex real-world scenarios by processing visual data, making informed
decisions, and carrying out actions automatically. The framework comprises one
generic vision module and two independent ROS nodes. As a test application, we
used ROSGPT_Vision to develop CarMate, which monitors the driver's distraction
on the roads and makes real-time vocal notifications to the driver. We showed
how ROSGPT_Vision significantly reduced the development cost compared to
traditional methods. We demonstrated how to improve the quality of the
application by optimizing the prompting strategies, without delving into
technical details. ROSGPT_Vision is shared with the community (link:
https://github.com/bilel-bj/ROSGPT_Vision) to advance robotic research in this
direction and to build more robotic frameworks that implement the PRM design
pattern and enables controlling robots using only prompts.
Related papers
- $π_0$: A Vision-Language-Action Flow Model for General Robot Control [77.32743739202543]
We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge.
We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people, and its ability to acquire new skills via fine-tuning.
arXiv Detail & Related papers (2024-10-31T17:22:30Z) - ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning [74.58666091522198]
We present a framework for intuitive robot programming by non-experts.
We leverage natural language prompts and contextual information from the Robot Operating System (ROS)
Our system integrates large language models (LLMs), enabling non-experts to articulate task requirements to the system through a chat interface.
arXiv Detail & Related papers (2024-06-28T08:28:38Z) - LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments.
We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model.
Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z) - RoboScript: Code Generation for Free-Form Manipulation Tasks across Real
and Simulation [77.41969287400977]
This paper presents textbfRobotScript, a platform for a deployable robot manipulation pipeline powered by code generation.
We also present a benchmark for a code generation benchmark for robot manipulation tasks in free-form natural language.
We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms.
arXiv Detail & Related papers (2024-02-22T15:12:00Z) - QUAR-VLA: Vision-Language-Action Model for Quadruped Robots [37.952398683031895]
The central idea is to elevate the overall intelligence of the robot.
We propose QUAdruped Robotic Transformer (QUART), a family of VLA models to integrate visual information and instructions from diverse modalities as input.
Our approach leads to performant robotic policies and enables QUART to obtain a range of emergent capabilities.
arXiv Detail & Related papers (2023-12-22T06:15:03Z) - WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model [92.90127398282209]
This paper investigates the potential of integrating the most recent Large Language Models (LLMs) and existing visual grounding and robotic grasping system.
We introduce the WALL-E (Embodied Robotic WAiter load lifting with Large Language model) as an example of this integration.
We deploy this LLM-empowered system on the physical robot to provide a more user-friendly interface for the instruction-guided grasping task.
arXiv Detail & Related papers (2023-08-30T11:35:21Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.