Read to Play (R2-Play): Decision Transformer with Multimodal Game Instruction
- URL: http://arxiv.org/abs/2402.04154v6
- Date: Wed, 5 Jun 2024 07:27:50 GMT
- Title: Read to Play (R2-Play): Decision Transformer with Multimodal Game Instruction
- Authors: Yonggang Jin, Ge Zhang, Hao Zhao, Tianyu Zheng, Jarvi Guo, Liuyu Xiang, Shawn Yue, Stephen W. Huang, Zhaofeng He, Jie Fu,
- Abstract summary: This paper explores enhanced forms of task guidance for agents, enabling them to comprehend gameplay instructions.
We construct a set of multimodal game instructions to incorporate instruction tuning into a decision transformer.
- Score: 22.31940101833938
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Developing a generalist agent is a longstanding objective in artificial intelligence. Previous efforts utilizing extensive offline datasets from various tasks demonstrate remarkable performance in multitasking scenarios within Reinforcement Learning. However, these works encounter challenges in extending their capabilities to new tasks. Recent approaches integrate textual guidance or visual trajectory into decision networks to provide task-specific contextual cues, representing a promising direction. However, it is observed that relying solely on textual guidance or visual trajectory is insufficient for accurately conveying the contextual information of tasks. This paper explores enhanced forms of task guidance for agents, enabling them to comprehend gameplay instructions, thereby facilitating a "read-to-play" capability. Drawing inspiration from the success of multimodal instruction tuning in visual tasks, we treat the visual-based RL task as a long-horizon vision task and construct a set of multimodal game instructions to incorporate instruction tuning into a decision transformer. Experimental results demonstrate that incorporating multimodal game instructions significantly enhances the decision transformer's multitasking and generalization capabilities.
Related papers
- Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts.
This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals.
We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z) - Multitask Vision-Language Prompt Tuning [103.5967011236282]
We propose multitask vision-language prompt tuning (MV)
MV incorporates cross-task knowledge into prompt tuning for vision-language models.
Results in 20 vision tasks demonstrate that the proposed approach outperforms all single-task baseline prompt tuning methods.
arXiv Detail & Related papers (2022-11-21T18:41:44Z) - Instruction-Following Agents with Multimodal Transformer [95.70039658112873]
We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments.
Our method consists of a multimodal transformer that encodes visual observations and language instructions.
We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings.
arXiv Detail & Related papers (2022-10-24T17:46:47Z) - Multimedia Generative Script Learning for Task Planning [58.73725388387305]
We propose a new task, Multimedia Generative Script Learning, to generate subsequent steps by tracking historical states in both text and vision modalities.
This task is challenging in three aspects: the multimedia challenge of capturing the visual states in images, the induction challenge of performing unseen tasks, and the diversity challenge of covering different information in individual steps.
Experiment results demonstrate that our approach significantly outperforms strong baselines.
arXiv Detail & Related papers (2022-08-25T19:04:28Z) - Fast Inference and Transfer of Compositional Task Structures for
Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph.
Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks.
Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z) - Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering [43.07139534653485]
We present Answer-Me, a task-aware multi-task framework.
We pre-train a vision-language joint model, which is multi-task as well.
Results show state-of-the-art performance, zero-shot generalization, robustness to forgetting, and competitive single-task results.
arXiv Detail & Related papers (2022-05-02T14:53:13Z) - Visual-and-Language Navigation: A Survey and Taxonomy [1.0742675209112622]
This paper provides a comprehensive survey on Visual-and-Language Navigation (VLN) tasks.
According to when the instructions are given, the tasks can be divided into single-turn and multi-turn.
This taxonomy enable researchers to better grasp the key point of a specific task and identify directions for future research.
arXiv Detail & Related papers (2021-08-26T01:51:18Z) - Episodic Transformer for Vision-and-Language Navigation [142.6236659368177]
This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions.
We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.
Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
arXiv Detail & Related papers (2021-05-13T17:51:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.