Related papers: Read to Play (R2-Play): Decision Transformer with Multimodal Game Instruction

Read to Play (R2-Play): Decision Transformer with Multimodal Game Instruction

URL: http://arxiv.org/abs/2402.04154v6
Date: Wed, 5 Jun 2024 07:27:50 GMT
Title: Read to Play (R2-Play): Decision Transformer with Multimodal Game Instruction
Authors: Yonggang Jin, Ge Zhang, Hao Zhao, Tianyu Zheng, Jarvi Guo, Liuyu Xiang, Shawn Yue, Stephen W. Huang, Zhaofeng He, Jie Fu,
Abstract summary: This paper explores enhanced forms of task guidance for agents, enabling them to comprehend gameplay instructions. We construct a set of multimodal game instructions to incorporate instruction tuning into a decision transformer.
Score: 22.31940101833938
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Developing a generalist agent is a longstanding objective in artificial intelligence. Previous efforts utilizing extensive offline datasets from various tasks demonstrate remarkable performance in multitasking scenarios within Reinforcement Learning. However, these works encounter challenges in extending their capabilities to new tasks. Recent approaches integrate textual guidance or visual trajectory into decision networks to provide task-specific contextual cues, representing a promising direction. However, it is observed that relying solely on textual guidance or visual trajectory is insufficient for accurately conveying the contextual information of tasks. This paper explores enhanced forms of task guidance for agents, enabling them to comprehend gameplay instructions, thereby facilitating a "read-to-play" capability. Drawing inspiration from the success of multimodal instruction tuning in visual tasks, we treat the visual-based RL task as a long-horizon vision task and construct a set of multimodal game instructions to incorporate instruction tuning into a decision transformer. Experimental results demonstrate that incorporating multimodal game instructions significantly enhances the decision transformer's multitasking and generalization capabilities.

Related papers

TalkFashion: Intelligent Virtual Try-On Assistant Based on Multimodal Large Language Model [19.347698118395673]
This paper addresses how to achieve multifunctional virtual try-on guided solely by text instructions.<n>We propose TalkFashion, an intelligent try-on assistant that leverages the powerful comprehension capabilities of large language models.<n>With the help of multi-modal models, this approach achieves fully automated local editings, enhancing the flexibility of editing tasks.
arXiv Detail & Related papers (2025-07-08T08:51:56Z)
Is Visual in-Context Learning for Compositional Medical Tasks within Reach? [68.56630652862293]
In this paper, we explore the potential of visual in-context learning to enable a single model to handle multiple tasks.<n>We introduce a novel method for training in-context learners using a synthetic compositional task generation engine.
arXiv Detail & Related papers (2025-07-01T15:32:23Z)
InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models [11.913271486031201]
We develop a Context-aware instructional task assistant with multi-modal large language models (InsTALL) InsTALL responds in real-time to user queries related to the task at hand. We show InsTALL achieves state-of-the-art performance across proposed sub-tasks considered for multimodal activity understanding.
arXiv Detail & Related papers (2025-01-21T15:55:06Z)
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment [58.94611347128066]
Task Preference Optimization (TPO) is a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks. By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance. Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models.
arXiv Detail & Related papers (2024-12-26T18:56:05Z)
Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts. This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals. We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z)
Multitask Vision-Language Prompt Tuning [103.5967011236282]
We propose multitask vision-language prompt tuning (MV) MV incorporates cross-task knowledge into prompt tuning for vision-language models. Results in 20 vision tasks demonstrate that the proposed approach outperforms all single-task baseline prompt tuning methods.
arXiv Detail & Related papers (2022-11-21T18:41:44Z)
Instruction-Following Agents with Multimodal Transformer [95.70039658112873]
We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments. Our method consists of a multimodal transformer that encodes visual observations and language instructions. We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings.
arXiv Detail & Related papers (2022-10-24T17:46:47Z)
Multimedia Generative Script Learning for Task Planning [58.73725388387305]
We propose a new task, Multimedia Generative Script Learning, to generate subsequent steps by tracking historical states in both text and vision modalities. This task is challenging in three aspects: the multimedia challenge of capturing the visual states in images, the induction challenge of performing unseen tasks, and the diversity challenge of covering different information in individual steps. Experiment results demonstrate that our approach significantly outperforms strong baselines.
arXiv Detail & Related papers (2022-08-25T19:04:28Z)
Fast Inference and Transfer of Compositional Task Structures for Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph. Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks. Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z)
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering [43.07139534653485]
We present Answer-Me, a task-aware multi-task framework. We pre-train a vision-language joint model, which is multi-task as well. Results show state-of-the-art performance, zero-shot generalization, robustness to forgetting, and competitive single-task results.
arXiv Detail & Related papers (2022-05-02T14:53:13Z)
Visual-and-Language Navigation: A Survey and Taxonomy [1.0742675209112622]
This paper provides a comprehensive survey on Visual-and-Language Navigation (VLN) tasks. According to when the instructions are given, the tasks can be divided into single-turn and multi-turn. This taxonomy enable researchers to better grasp the key point of a specific task and identify directions for future research.
arXiv Detail & Related papers (2021-08-26T01:51:18Z)
Episodic Transformer for Vision-and-Language Navigation [142.6236659368177]
This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions. We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions. Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
arXiv Detail & Related papers (2021-05-13T17:51:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.