NaturalVLM: Leveraging Fine-grained Natural Language for
Affordance-Guided Visual Manipulation
- URL: http://arxiv.org/abs/2403.08355v1
- Date: Wed, 13 Mar 2024 09:12:16 GMT
- Title: NaturalVLM: Leveraging Fine-grained Natural Language for
Affordance-Guided Visual Manipulation
- Authors: Ran Xu, Yan Shen, Xiaoqi Li, Ruihai Wu, Hao Dong
- Abstract summary: Many real-world tasks demand intricate multi-step reasoning.
We introduce a benchmark, NrVLM, comprising 15 distinct manipulation tasks.
We propose a novel learning framework that completes the manipulation task step-by-step according to the fine-grained instructions.
- Score: 21.02437461550044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Enabling home-assistant robots to perceive and manipulate a diverse range of
3D objects based on human language instructions is a pivotal challenge. Prior
research has predominantly focused on simplistic and task-oriented
instructions, i.e., "Slide the top drawer open". However, many real-world tasks
demand intricate multi-step reasoning, and without human instructions, these
will become extremely difficult for robot manipulation. To address these
challenges, we introduce a comprehensive benchmark, NrVLM, comprising 15
distinct manipulation tasks, containing over 4500 episodes meticulously
annotated with fine-grained language instructions. We split the long-term task
process into several steps, with each step having a natural language
instruction. Moreover, we propose a novel learning framework that completes the
manipulation task step-by-step according to the fine-grained instructions.
Specifically, we first identify the instruction to execute, taking into account
visual observations and the end-effector's current state. Subsequently, our
approach facilitates explicit learning through action-prompts and
perception-prompts to promote manipulation-aware cross-modality alignment.
Leveraging both visual observations and linguistic guidance, our model outputs
a sequence of actionable predictions for manipulation, including contact points
and end-effector poses. We evaluate our method and baselines using the proposed
benchmark NrVLM. The experimental results demonstrate the effectiveness of our
approach. For additional details, please refer to
https://sites.google.com/view/naturalvlm.
Related papers
- Natural Language as Policies: Reasoning for Coordinate-Level Embodied Control with LLMs [7.746160514029531]
We demonstrate experimental results with LLMs that address robotics task planning problems.
Our approach acquires text descriptions of the task and scene objects, then formulates task planning through natural language reasoning.
Our approach is evaluated on a multi-modal prompt simulation benchmark.
arXiv Detail & Related papers (2024-03-20T17:58:12Z) - Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts.
This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals.
We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z) - $A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting
Vision-and-Language Ability of Foundation Models [89.64729024399634]
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions.
Normally, the instructions have complex grammatical structures and often contain various action descriptions.
How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging.
arXiv Detail & Related papers (2023-08-15T19:01:19Z) - Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions
with Large Language Model [63.66204449776262]
Instruct2Act is a framework that maps multi-modal instructions to sequential actions for robotic manipulation tasks.
Our approach is adjustable and flexible in accommodating various instruction modalities and input types.
Our zero-shot method outperformed many state-of-the-art learning-based policies in several tasks.
arXiv Detail & Related papers (2023-05-18T17:59:49Z) - Using Both Demonstrations and Language Instructions to Efficiently Learn
Robotic Tasks [21.65346551790888]
DeL-TaCo is a method for conditioning a robotic policy on task embeddings comprised of two components: a visual demonstration and a language instruction.
To our knowledge, this is the first work to show that simultaneously conditioning a multi-task robotic manipulation policy on both demonstration and language embeddings improves sample efficiency and generalization over conditioning on either modality alone.
arXiv Detail & Related papers (2022-10-10T08:06:58Z) - Instruction-driven history-aware policies for robotic manipulations [82.25511767738224]
We propose a unified transformer-based approach that takes into account multiple inputs.
In particular, our transformer architecture integrates (i) natural language instructions and (ii) multi-view scene observations.
We evaluate our method on the challenging RLBench benchmark and on a real-world robot.
arXiv Detail & Related papers (2022-09-11T16:28:25Z) - Chain of Thought Imitation with Procedure Cloning [129.62135987416164]
We propose procedure cloning, which applies supervised sequence prediction to imitate the series of expert computations.
We show that imitating the intermediate computations of an expert's behavior enables procedure cloning to learn policies exhibiting significant generalization to unseen environment configurations.
arXiv Detail & Related papers (2022-05-22T13:14:09Z) - Visual-and-Language Navigation: A Survey and Taxonomy [1.0742675209112622]
This paper provides a comprehensive survey on Visual-and-Language Navigation (VLN) tasks.
According to when the instructions are given, the tasks can be divided into single-turn and multi-turn.
This taxonomy enable researchers to better grasp the key point of a specific task and identify directions for future research.
arXiv Detail & Related papers (2021-08-26T01:51:18Z) - Episodic Transformer for Vision-and-Language Navigation [142.6236659368177]
This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions.
We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.
Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
arXiv Detail & Related papers (2021-05-13T17:51:46Z) - Ask Your Humans: Using Human Instructions to Improve Generalization in
Reinforcement Learning [32.82030512053361]
We propose the use of step-by-step human demonstrations in the form of natural language instructions and action trajectories.
We find that human demonstrations help solve the most complex tasks.
We also find that incorporating natural language allows the model to generalize to unseen tasks in a zero-shot setting.
arXiv Detail & Related papers (2020-11-01T14:39:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.