Breaking Down the Task: A Unit-Grained Hybrid Training Framework for
Vision and Language Decision Making
- URL: http://arxiv.org/abs/2307.08016v1
- Date: Sun, 16 Jul 2023 11:54:16 GMT
- Title: Breaking Down the Task: A Unit-Grained Hybrid Training Framework for
Vision and Language Decision Making
- Authors: Ruipu Luo, Jiwen Zhang, Zhongyu Wei
- Abstract summary: Vision language decision making (VLDM) is a challenging multimodal task.
From an environment perspective, we find that task episodes can be divided into fine-grained textitunits
We propose a novel hybrid-training framework that enables active exploration in the environment and reduces the exposure bias.
- Score: 19.87916700767421
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision language decision making (VLDM) is a challenging multimodal task. The
agent have to understand complex human instructions and complete compositional
tasks involving environment navigation and object manipulation. However, the
long action sequences involved in VLDM make the task difficult to learn. From
an environment perspective, we find that task episodes can be divided into
fine-grained \textit{units}, each containing a navigation phase and an
interaction phase. Since the environment within a unit stays unchanged, we
propose a novel hybrid-training framework that enables active exploration in
the environment and reduces the exposure bias. Such framework leverages the
unit-grained configurations and is model-agnostic. Specifically, we design a
Unit-Transformer (UT) with an intrinsic recurrent state that maintains a
unit-scale cross-modal memory. Through extensive experiments on the TEACH
benchmark, we demonstrate that our proposed framework outperforms existing
state-of-the-art methods in terms of all evaluation metrics. Overall, our work
introduces a novel approach to tackling the VLDM task by breaking it down into
smaller, manageable units and utilizing a hybrid-training framework. By doing
so, we provide a more flexible and effective solution for multimodal decision
making.
Related papers
- DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control [53.80518003412016]
Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research.
We study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair.
We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls.
arXiv Detail & Related papers (2024-07-20T05:39:28Z) - TDAG: A Multi-Agent Framework based on Dynamic Task Decomposition and
Agent Generation [45.028795422801764]
We propose a multi-agent framework based on dynamic Task Decomposition and Agent Generation (TDAG)
This framework dynamically decomposes complex tasks into smaller subtasks and assigns each to a specifically generated subagent.
ItineraryBench is designed to assess agents' abilities in memory, planning, and tool usage across tasks of varying complexity.
arXiv Detail & Related papers (2024-02-15T18:27:37Z) - Intrinsic Language-Guided Exploration for Complex Long-Horizon Robotic
Manipulation Tasks [12.27904219271791]
Current reinforcement learning algorithms struggle in sparse and complex environments.
We propose the Intrinsically Guided Exploration from Large Language Models (IGE-LLMs) framework.
arXiv Detail & Related papers (2023-09-28T11:14:52Z) - Unified Human-Scene Interaction via Prompted Chain-of-Contacts [61.87652569413429]
Human-Scene Interaction (HSI) is a vital component of fields like embodied AI and virtual reality.
This paper presents a unified HSI framework, UniHSI, which supports unified control of diverse interactions through language commands.
arXiv Detail & Related papers (2023-09-14T17:59:49Z) - Efficient Skill Acquisition for Complex Manipulation Tasks in Obstructed
Environments [18.348489257164356]
We propose a system for efficient skill acquisition that leverages an object-centric generative model (OCGM) for versatile goal identification.
OCGM enables one-shot target object identification and re-identification in new scenes, allowing MP to guide the robot to the target object while avoiding obstacles.
arXiv Detail & Related papers (2023-03-06T18:49:59Z) - Automatic Goal Generation using Dynamical Distance Learning [5.797847756967884]
Reinforcement Learning (RL) agents can learn to solve complex sequential decision making tasks by interacting with the environment.
In the field of multi-goal RL, where agents are required to reach multiple goals to solve complex tasks, improving sample efficiency can be especially challenging.
We propose a method for automatic goal generation using a dynamical distance function (DDF) in a self-supervised fashion.
arXiv Detail & Related papers (2021-11-07T16:23:56Z) - Multitask Adaptation by Retrospective Exploration with Learned World
Models [77.34726150561087]
We propose a meta-learned addressing model called RAMa that provides training samples for the MBRL agent taken from task-agnostic storage.
The model is trained to maximize the expected agent's performance by selecting promising trajectories solving prior tasks from the storage.
arXiv Detail & Related papers (2021-10-25T20:02:57Z) - Learning Multi-Objective Curricula for Deep Reinforcement Learning [55.27879754113767]
Various automatic curriculum learning (ACL) methods have been proposed to improve the sample efficiency and final performance of deep reinforcement learning (DRL)
In this paper, we propose a unified automatic curriculum learning framework to create multi-objective but coherent curricula.
In addition to existing hand-designed curricula paradigms, we further design a flexible memory mechanism to learn an abstract curriculum.
arXiv Detail & Related papers (2021-10-06T19:30:25Z) - CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and
Transfer Learning [138.40338621974954]
CausalWorld is a benchmark for causal structure and transfer learning in a robotic manipulation environment.
Tasks consist of constructing 3D shapes from a given set of blocks - inspired by how children learn to build complex structures.
arXiv Detail & Related papers (2020-10-08T23:01:13Z) - Meta Reinforcement Learning with Autonomous Inference of Subtask
Dependencies [57.27944046925876]
We propose and address a novel few-shot RL problem, where a task is characterized by a subtask graph.
Instead of directly learning a meta-policy, we develop a Meta-learner with Subtask Graph Inference.
Our experiment results on two grid-world domains and StarCraft II environments show that the proposed method is able to accurately infer the latent task parameter.
arXiv Detail & Related papers (2020-01-01T17:34:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.