GROOT-2: Weakly Supervised Multi-Modal Instruction Following Agents
- URL: http://arxiv.org/abs/2412.10410v1
- Date: Sat, 07 Dec 2024 05:47:49 GMT
- Title: GROOT-2: Weakly Supervised Multi-Modal Instruction Following Agents
- Authors: Shaofei Cai, Bowei Zhang, Zihao Wang, Haowei Lin, Xiaojian Ma, Anji Liu, Yitao Liang,
- Abstract summary: We introduce GROOT-2, a multimodal agent trained using a novel approach that combines weak supervision with latent variable models.<n>GROOT-2's effectiveness is validated across four diverse environments, ranging from video games to robotic manipulation.
- Score: 25.195426389757355
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Developing agents that can follow multimodal instructions remains a fundamental challenge in robotics and AI. Although large-scale pre-training on unlabeled datasets (no language instruction) has enabled agents to learn diverse behaviors, these agents often struggle with following instructions. While augmenting the dataset with instruction labels can mitigate this issue, acquiring such high-quality annotations at scale is impractical. To address this issue, we frame the problem as a semi-supervised learning task and introduce GROOT-2, a multimodal instructable agent trained using a novel approach that combines weak supervision with latent variable models. Our method consists of two key components: constrained self-imitating, which utilizes large amounts of unlabeled demonstrations to enable the policy to learn diverse behaviors, and human intention alignment, which uses a smaller set of labeled demonstrations to ensure the latent space reflects human intentions. GROOT-2's effectiveness is validated across four diverse environments, ranging from video games to robotic manipulation, demonstrating its robust multimodal instruction-following capabilities.
Related papers
- Can Language Models Follow Multiple Turns of Entangled Instructions? [4.44881011141635]
Real-world scenarios require consistency across multiple instructions over time, such as secret privacy, personal preferences, and prioritization.
This work presents a systematic investigation of large language models' capabilities in handling multiple turns of instructions.
We construct MultiTurnInstruct with around 1.1K high-quality multi-turn conversations through the human-in-the-loop approach.
arXiv Detail & Related papers (2025-03-17T14:31:37Z) - VTAO-BiManip: Masked Visual-Tactile-Action Pre-training with Object Understanding for Bimanual Dexterous Manipulation [8.882764358932276]
Bimanual dexterous manipulation remains significant challenges in robotics due to the high DoFs of each hand and their coordination.
Existing single-hand manipulation techniques often leverage human demonstrations to guide RL methods but fail to generalize to complex bimanual tasks involving multiple sub-skills.
We introduce VTAO-BiManip, a novel framework that combines visual-tactile-action pretraining with object understanding to facilitate curriculum RL to enable human-like bimanual manipulation.
arXiv Detail & Related papers (2025-01-07T08:14:53Z) - LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments.
We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model.
Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z) - Large Language Models for Orchestrating Bimanual Robots [19.60907949776435]
We present LAnguage-model-based Bimanual ORchestration (LABOR) to analyze task configurations and devise coordination control policies.
We evaluate our method through simulated experiments involving two classes of long-horizon tasks using the NICOL humanoid robot.
arXiv Detail & Related papers (2024-04-02T15:08:35Z) - Multi-task real-robot data with gaze attention for dual-arm fine manipulation [4.717749411286867]
This paper introduces a dataset of diverse object manipulations that includes dual-arm tasks and/or tasks requiring fine manipulation.
We have generated dataset with 224k episodes (150 hours, 1,104 language instructions) which includes dual-arm fine tasks such as bowl-moving, pencil-case opening or banana-peeling.
This dataset includes visual attention signals as well as dual-action labels, a signal that separates actions into a robust reaching trajectory and precise interaction with objects, and language instructions to achieve robust and precise object manipulation.
arXiv Detail & Related papers (2024-01-15T11:20:34Z) - Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts.
This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals.
We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z) - Skill Disentanglement for Imitation Learning from Suboptimal
Demonstrations [60.241144377865716]
We consider the imitation of sub-optimal demonstrations, with both a small clean demonstration set and a large noisy set.
We propose method by evaluating and imitating at the sub-demonstration level, encoding action primitives of varying quality into different skills.
arXiv Detail & Related papers (2023-06-13T17:24:37Z) - Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions
with Large Language Model [63.66204449776262]
Instruct2Act is a framework that maps multi-modal instructions to sequential actions for robotic manipulation tasks.
Our approach is adjustable and flexible in accommodating various instruction modalities and input types.
Our zero-shot method outperformed many state-of-the-art learning-based policies in several tasks.
arXiv Detail & Related papers (2023-05-18T17:59:49Z) - Learning Robot Manipulation from Cross-Morphology Demonstration [0.9615284569035419]
Some Learning from Demonstrations (LfD) methods handle small mismatches in the action spaces of the teacher and student.
Here we address the case where the teacher's morphology is substantially different from that of the student.
Our framework, Morphological Adaptation in Imitation Learning (MAIL), bridges this gap allowing us to train an agent from demonstrations by other agents with significantly different morphologies.
arXiv Detail & Related papers (2023-04-07T20:21:47Z) - CLAS: Coordinating Multi-Robot Manipulation with Central Latent Action
Spaces [9.578169216444813]
This paper proposes an approach to coordinating multi-robot manipulation through learned latent action spaces that are shared across different agents.
We validate our method in simulated multi-robot manipulation tasks and demonstrate improvement over previous baselines in terms of sample efficiency and learning performance.
arXiv Detail & Related papers (2022-11-28T23:20:47Z) - Learning Transferable Adversarial Robust Representations via Multi-view
Consistency [57.73073964318167]
We propose a novel meta-adversarial multi-view representation learning framework with dual encoders.
We demonstrate the effectiveness of our framework on few-shot learning tasks from unseen domains.
arXiv Detail & Related papers (2022-10-19T11:48:01Z) - Using Both Demonstrations and Language Instructions to Efficiently Learn
Robotic Tasks [21.65346551790888]
DeL-TaCo is a method for conditioning a robotic policy on task embeddings comprised of two components: a visual demonstration and a language instruction.
To our knowledge, this is the first work to show that simultaneously conditioning a multi-task robotic manipulation policy on both demonstration and language embeddings improves sample efficiency and generalization over conditioning on either modality alone.
arXiv Detail & Related papers (2022-10-10T08:06:58Z) - Learning Neuro-Symbolic Skills for Bilevel Planning [63.388694268198655]
Decision-making is challenging in robotics environments with continuous object-centric states, continuous actions, long horizons, and sparse feedback.
Hierarchical approaches, such as task and motion planning (TAMP), address these challenges by decomposing decision-making into two or more levels of abstraction.
Our main contribution is a method for learning parameterized polices in combination with operators and samplers.
arXiv Detail & Related papers (2022-06-21T19:01:19Z) - Learning Multi-Arm Manipulation Through Collaborative Teleoperation [63.35924708783826]
Imitation Learning (IL) is a powerful paradigm to teach robots to perform manipulation tasks.
Many real-world tasks require multiple arms, such as lifting a heavy object or assembling a desk.
We present Multi-Arm RoboTurk (MART), a multi-user data collection platform that allows multiple remote users to simultaneously teleoperate a set of robotic arms.
arXiv Detail & Related papers (2020-12-12T05:43:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.