GenRL: Multimodal-foundation world models for generalization in embodied agents
- URL: http://arxiv.org/abs/2406.18043v2
- Date: Wed, 30 Oct 2024 20:16:18 GMT
- Title: GenRL: Multimodal-foundation world models for generalization in embodied agents
- Authors: Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Aaron Courville, Sai Rajeswar,
- Abstract summary: Reinforcement learning (RL) is hard to scale up as it requires a complex reward design for each task.
Current foundation vision-language models (VLMs) require fine-tuning or other adaptations to be adopted in embodied contexts.
Lack of multimodal data in such domains represents an obstacle to developing foundation models for embodied applications.
- Score: 12.263162194821787
- License:
- Abstract: Learning generalist embodied agents, able to solve multitudes of tasks in different domains is a long-standing problem. Reinforcement learning (RL) is hard to scale up as it requires a complex reward design for each task. In contrast, language can specify tasks in a more natural way. Current foundation vision-language models (VLMs) generally require fine-tuning or other adaptations to be adopted in embodied contexts, due to the significant domain gap. However, the lack of multimodal data in such domains represents an obstacle to developing foundation models for embodied applications. In this work, we overcome these problems by presenting multimodal-foundation world models, able to connect and align the representation of foundation VLMs with the latent space of generative world models for RL, without any language annotations. The resulting agent learning framework, GenRL, allows one to specify tasks through vision and/or language prompts, ground them in the embodied domain's dynamics, and learn the corresponding behaviors in imagination. As assessed through large-scale multi-task benchmarking in locomotion and manipulation domains, GenRL enables multi-task generalization from language and visual prompts. Furthermore, by introducing a data-free policy learning strategy, our approach lays the groundwork for foundational policy learning using generative world models. Website, code and data: https://mazpie.github.io/genrl/
Related papers
- From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons [85.99268361356832]
We introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA)
GEA is a single unified model capable of grounding itself across varied domains through a multi-embodiment action tokenizer.
Our findings reveal the importance of training with cross-domain data and online RL for building generalist agents.
arXiv Detail & Related papers (2024-12-11T15:06:25Z) - From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models [7.704773649029078]
Vision-language models (VLMs) have tremendous potential for grounding language.
This paper introduces a novel decomposition of the problem of building language-conditioned agents (LCAs)
We also explore several enhancements to the speed and quality of VLM-based LCAs.
arXiv Detail & Related papers (2024-09-24T12:24:07Z) - Probing Multimodal Large Language Models for Global and Local Semantic Representations [57.25949445963422]
We study which layers of Multimodal Large Language Models make the most effort to the global image information.
In this study, we find that the intermediate layers of models can encode more global semantic information.
We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.
arXiv Detail & Related papers (2024-02-27T08:27:15Z) - GroundingGPT:Language Enhanced Multi-modal Grounding Model [15.44099961048236]
We propose GroundingGPT, a language enhanced multi-modal grounding model.
Our proposed model excels at tasks demanding a detailed understanding of local information within the input.
It demonstrates precise identification and localization of specific regions in images or moments in videos.
arXiv Detail & Related papers (2024-01-11T17:41:57Z) - AllSpark: A Multimodal Spatio-Temporal General Intelligence Model with Ten Modalities via Language as a Reference Framework [21.10693332367192]
We present AllSpark, a multimodal-temporal general artificial intelligence model.
Our model integrates ten different modalities into a unified framework.
Experiments indicate that the incorporation of language enables AllSpark to excel in few-shot classification tasks.
arXiv Detail & Related papers (2023-12-31T17:21:02Z) - LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language
Models [56.25156596019168]
This paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for large language models (LLMs)
Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
arXiv Detail & Related papers (2023-11-30T03:59:31Z) - Kosmos-2: Grounding Multimodal Large Language Models to the World [107.27280175398089]
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM)
It enables new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world.
Code and pretrained models are available at https://aka.ms/kosmos-2.
arXiv Detail & Related papers (2023-06-26T16:32:47Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.