From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons
- URL: http://arxiv.org/abs/2412.08442v1
- Date: Wed, 11 Dec 2024 15:06:25 GMT
- Title: From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons
- Authors: Andrew Szot, Bogdan Mazoure, Omar Attia, Aleksei Timofeev, Harsh Agrawal, Devon Hjelm, Zhe Gan, Zsolt Kira, Alexander Toshev,
- Abstract summary: We introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA)
GEA is a single unified model capable of grounding itself across varied domains through a multi-embodiment action tokenizer.
Our findings reveal the importance of training with cross-domain data and online RL for building generalist agents.
- Score: 85.99268361356832
- License:
- Abstract: We examine the capability of Multimodal Large Language Models (MLLMs) to tackle diverse domains that extend beyond the traditional language and vision tasks these models are typically trained on. Specifically, our focus lies in areas such as Embodied AI, Games, UI Control, and Planning. To this end, we introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA). GEA is a single unified model capable of grounding itself across these varied domains through a multi-embodiment action tokenizer. GEA is trained with supervised learning on a large dataset of embodied experiences and with online RL in interactive simulators. We explore the data and algorithmic choices necessary to develop such a model. Our findings reveal the importance of training with cross-domain data and online RL for building generalist agents. The final GEA model achieves strong generalization performance to unseen tasks across diverse benchmarks compared to other generalist models and benchmark-specific approaches.
Related papers
- LFME: A Simple Framework for Learning from Multiple Experts in Domain Generalization [61.16890890570814]
Domain generalization (DG) methods aim to maintain good performance in an unseen target domain by using training data from multiple source domains.
This work introduces a simple yet effective framework, dubbed learning from multiple experts (LFME) that aims to make the target model an expert in all source domains to improve DG.
arXiv Detail & Related papers (2024-10-22T13:44:10Z) - Meta-DT: Offline Meta-RL as Conditional Sequence Modeling with World Model Disentanglement [41.7426496795769]
We propose Meta Decision Transformer (Meta-DT) to achieve efficient generalization in offline meta-RL.
We pretrain a context-aware world model to learn a compact task representation, and inject it as a contextual condition to guide task-oriented sequence generation.
We show that Meta-DT exhibits superior few and zero-shot generalization capacity compared to strong baselines.
arXiv Detail & Related papers (2024-10-15T09:51:30Z) - Learning to Generalize Unseen Domains via Multi-Source Meta Learning for Text Classification [71.08024880298613]
We study the multi-source Domain Generalization of text classification.
We propose a framework to use multiple seen domains to train a model that can achieve high accuracy in an unseen domain.
arXiv Detail & Related papers (2024-09-20T07:46:21Z) - VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks [6.731844884087068]
We propose VolDoGer: Vision-Language dataset for Domain Generalization.
This dataset addresses three vision-language tasks: image captioning, visual question answering, and visual entailment.
We extend LLM-based data annotation techniques to vision-language tasks, thereby alleviating the burden of recruiting human annotators.
arXiv Detail & Related papers (2024-07-29T08:38:46Z) - GenRL: Multimodal-foundation world models for generalization in embodied agents [12.263162194821787]
Reinforcement learning (RL) is hard to scale up as it requires a complex reward design for each task.
Current foundation vision-language models (VLMs) require fine-tuning or other adaptations to be adopted in embodied contexts.
Lack of multimodal data in such domains represents an obstacle to developing foundation models for embodied applications.
arXiv Detail & Related papers (2024-06-26T03:41:48Z) - Grounding Multimodal Large Language Models in Actions [65.88208317380793]
We study how to best ground a MLLM into different embodiments and their associated action spaces.
For continuous actions, we show that a learned tokenization allows for sufficient modeling precision.
For discrete actions, we demonstrate that semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance.
arXiv Detail & Related papers (2024-06-12T06:12:04Z) - SilverSight: A Multi-Task Chinese Financial Large Language Model Based on Adaptive Semantic Space Learning [4.540505713937026]
This study introduces an Adaptive Semantic Space Learning (ASSL) framework to enhance the performance and selection efficacy of multi-expert models.
Our research findings demonstrate that our framework can achieve results close to those obtained with full data training using only 10% of the data, while also exhibiting strong generalization capabilities.
arXiv Detail & Related papers (2024-04-07T13:02:21Z) - Unveiling the Generalization Power of Fine-Tuned Large Language Models [81.70754292058258]
We investigate whether fine-tuning affects the intrinsic generalization ability intrinsic to Large Language Models (LLMs)
Our main findings reveal that models fine-tuned on generation and classification tasks exhibit dissimilar behaviors in generalizing to different domains and tasks.
We observe that integrating the in-context learning strategy during fine-tuning on generation tasks can enhance the model's generalization ability.
arXiv Detail & Related papers (2024-03-14T08:18:59Z) - An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents.
Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction.
We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z) - Exploiting Style Transfer-based Task Augmentation for Cross-Domain
Few-Shot Learning [4.678020383205135]
In cross-domain few-shot learning, the model trained on source domains struggles to generalize to the target domain.
We propose Task Augmented Meta-Learning (TAML) to conduct style transfer-based task augmentation.
The proposed TAML increases the diversity of styles of training tasks, and contributes to training a model with better domain generalization ability.
arXiv Detail & Related papers (2023-01-19T07:32:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.