Related papers: Multimodal foundation world models for generalist embodied agents

Multimodal foundation world models for generalist embodied agents

URL: http://arxiv.org/abs/2406.18043v1
Date: Wed, 26 Jun 2024 03:41:48 GMT
Title: Multimodal foundation world models for generalist embodied agents
Authors: Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Aaron Courville, Sai Rajeswar,
Abstract summary: Reinforcement learning (RL) is hard to scale up as it requires a complex reward design for each task. Current foundation vision-language models (VLMs) generally require fine-tuning or other adaptations to be functional. We present multimodal foundation world models, able to connect and align the representation of foundation VLMs with the latent space of generative world models for RL. The resulting agent learning framework, GenRL, allows one to specify tasks through vision locomotion and/or language prompts, ground them in the embodied domain's dynamics, and learns the corresponding behaviors in imagination.
Score: 12.263162194821787
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learning generalist embodied agents, able to solve multitudes of tasks in different domains is a long-standing problem. Reinforcement learning (RL) is hard to scale up as it requires a complex reward design for each task. In contrast, language can specify tasks in a more natural way. Current foundation vision-language models (VLMs) generally require fine-tuning or other adaptations to be functional, due to the significant domain gap. However, the lack of multimodal data in such domains represents an obstacle toward developing foundation models for embodied applications. In this work, we overcome these problems by presenting multimodal foundation world models, able to connect and align the representation of foundation VLMs with the latent space of generative world models for RL, without any language annotations. The resulting agent learning framework, GenRL, allows one to specify tasks through vision and/or language prompts, ground them in the embodied domain's dynamics, and learns the corresponding behaviors in imagination. As assessed through large-scale multi-task benchmarking, GenRL exhibits strong multi-task generalization performance in several locomotion and manipulation domains. Furthermore, by introducing a data-free RL strategy, it lays the groundwork for foundation model-based RL for generalist embodied agents.

Related papers

From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons [85.99268361356832]
We introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA) GEA is a single unified model capable of grounding itself across varied domains through a multi-embodiment action tokenizer. Our findings reveal the importance of training with cross-domain data and online RL for building generalist agents.
arXiv Detail & Related papers (2024-12-11T15:06:25Z)
Bridging Environments and Language with Rendering Functions and Vision-Language Models [7.704773649029078]
Vision-language models (VLMs) have tremendous potential for grounding language. This paper introduces a novel decomposition of the problem of building language-conditioned agents (LCAs) We also explore several enhancements to the speed and quality of VLM-based LCAs.
arXiv Detail & Related papers (2024-09-24T12:24:07Z)
VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks [6.731844884087068]
We propose VolDoGer: Vision-Language dataset for Domain Generalization. This dataset addresses three vision-language tasks: image captioning, visual question answering, and visual entailment. We extend LLM-based data annotation techniques to vision-language tasks, thereby alleviating the burden of recruiting human annotators.
arXiv Detail & Related papers (2024-07-29T08:38:46Z)
Scalable Language Model with Generalized Continual Learning [58.700439919096155]
The Joint Adaptive Re-ization (JARe) is integrated with Dynamic Task-related Knowledge Retrieval (DTKR) to enable adaptive adjustment of language models based on specific downstream tasks. Our method demonstrates state-of-the-art performance on diverse backbones and benchmarks, achieving effective continual learning in both full-set and few-shot scenarios with minimal forgetting.
arXiv Detail & Related papers (2024-04-11T04:22:15Z)
Probing Multimodal Large Language Models for Global and Local Semantic Representations [57.25949445963422]
We study which layers of Multimodal Large Language Models make the most effort to the global image information. In this study, we find that the intermediate layers of models can encode more global semantic information. We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.
arXiv Detail & Related papers (2024-02-27T08:27:15Z)
GroundingGPT:Language Enhanced Multi-modal Grounding Model [15.44099961048236]
We propose GroundingGPT, a language enhanced multi-modal grounding model. Our proposed model excels at tasks demanding a detailed understanding of local information within the input. It demonstrates precise identification and localization of specific regions in images or moments in videos.
arXiv Detail & Related papers (2024-01-11T17:41:57Z)
AllSpark: A Multimodal Spatio-Temporal General Intelligence Model with Ten Modalities via Language as a Reference Framework [21.10693332367192]
We present AllSpark, a multimodal-temporal general artificial intelligence model. Our model integrates ten different modalities into a unified framework. Experiments indicate that the incorporation of language enables AllSpark to excel in few-shot classification tasks.
arXiv Detail & Related papers (2023-12-31T17:21:02Z)
LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models [56.25156596019168]
This paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for large language models (LLMs) Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
arXiv Detail & Related papers (2023-11-30T03:59:31Z)
Kosmos-2: Grounding Multimodal Large Language Models to the World [107.27280175398089]
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM) It enables new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Code and pretrained models are available at https://aka.ms/kosmos-2.
arXiv Detail & Related papers (2023-06-26T16:32:47Z)
Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks. We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling. Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z)
PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks. Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z)
On the cross-lingual transferability of multilingual prototypical models across NLU tasks [2.44288434255221]
Supervised deep learning-based approaches have been applied to task-oriented dialog and have proven to be effective for limited domain and language applications. In practice, these approaches suffer from the drawbacks of domain-driven design and under-resourced languages. This article proposes to investigate the cross-lingual transferability of using synergistically few-shot learning with prototypical neural networks and multilingual Transformers-based models.
arXiv Detail & Related papers (2022-07-19T09:55:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.