AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot
Manipulation
- URL: http://arxiv.org/abs/2305.18898v1
- Date: Tue, 30 May 2023 09:54:20 GMT
- Title: AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot
Manipulation
- Authors: Chuhao Jin, Wenhui Tan, Jiange Yang, Bei Liu, Ruihua Song, Limin Wang,
Jianlong Fu
- Abstract summary: We propose a novel framework for learning high-level cognitive capabilities in robot manipulation tasks.
The resulting dataset AlphaBlock consists of 35 comprehensive high-level tasks of multi-step text plans and paired observation.
- Score: 50.737355245505334
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a novel framework for learning high-level cognitive capabilities
in robot manipulation tasks, such as making a smiley face using building
blocks. These tasks often involve complex multi-step reasoning, presenting
significant challenges due to the limited paired data connecting human
instructions (e.g., making a smiley face) and robot actions (e.g., end-effector
movement). Existing approaches relieve this challenge by adopting an open-loop
paradigm decomposing high-level instructions into simple sub-task plans, and
executing them step-by-step using low-level control models. However, these
approaches are short of instant observations in multi-step reasoning, leading
to sub-optimal results. To address this issue, we propose to automatically
collect a cognitive robot dataset by Large Language Models (LLMs). The
resulting dataset AlphaBlock consists of 35 comprehensive high-level tasks of
multi-step text plans and paired observation sequences. To enable efficient
data acquisition, we employ elaborated multi-round prompt designs that
effectively reduce the burden of extensive human involvement. We further
propose a closed-loop multi-modal embodied planning model that autoregressively
generates plans by taking image observations as input. To facilitate effective
learning, we leverage MiniGPT-4 with a frozen visual encoder and LLM, and
finetune additional vision adapter and Q-former to enable fine-grained spatial
perception for manipulation tasks. We conduct experiments to verify the
superiority over existing open and closed-loop methods, and achieve a
significant increase in success rate by 21.4% and 14.5% over ChatGPT and GPT-4
based robot tasks. Real-world demos are shown in
https://www.youtube.com/watch?v=ayAzID1_qQk .
Related papers
- COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models [49.24666980374751]
COHERENT is a novel LLM-based task planning framework for collaboration of heterogeneous multi-robot systems.
A Proposal-Execution-Feedback-Adjustment mechanism is designed to decompose and assign actions for individual robots.
The experimental results show that our work surpasses the previous methods by a large margin in terms of success rate and execution efficiency.
arXiv Detail & Related papers (2024-09-23T15:53:41Z) - Affordance-Guided Reinforcement Learning via Visual Prompting [51.361977466993345]
Keypoint-based Affordance Guidance for Improvements (KAGI) is a method leveraging rewards shaped by vision-language models (VLMs) for autonomous RL.
On real-world manipulation tasks specified by natural language descriptions, KAGI improves the sample efficiency of autonomous RL and enables successful task completion in 20K online fine-tuning steps.
arXiv Detail & Related papers (2024-07-14T21:41:29Z) - MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions.
Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world.
We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z) - QUAR-VLA: Vision-Language-Action Model for Quadruped Robots [37.952398683031895]
The central idea is to elevate the overall intelligence of the robot.
We propose QUAdruped Robotic Transformer (QUART), a family of VLA models to integrate visual information and instructions from diverse modalities as input.
Our approach leads to performant robotic policies and enables QUART to obtain a range of emergent capabilities.
arXiv Detail & Related papers (2023-12-22T06:15:03Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts.
This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals.
We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z) - VoxPoser: Composable 3D Value Maps for Robotic Manipulation with
Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation.
We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects.
We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z) - Chat with the Environment: Interactive Multimodal Perception Using Large
Language Models [19.623070762485494]
Large Language Models (LLMs) have shown remarkable reasoning ability in few-shot robotic planning.
Our study demonstrates that LLMs can provide high-level planning and reasoning skills and control interactive robot behavior in a multimodal environment.
arXiv Detail & Related papers (2023-03-14T23:01:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.