Related papers: From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

URL: http://arxiv.org/abs/2412.08442v1
Date: Wed, 11 Dec 2024 15:06:25 GMT
Title: From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons
Authors: Andrew Szot, Bogdan Mazoure, Omar Attia, Aleksei Timofeev, Harsh Agrawal, Devon Hjelm, Zhe Gan, Zsolt Kira, Alexander Toshev,
Abstract summary: We introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA)<n>GEA is a single unified model capable of grounding itself across varied domains through a multi-embodiment action tokenizer.<n>Our findings reveal the importance of training with cross-domain data and online RL for building generalist agents.
Score: 85.99268361356832
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We examine the capability of Multimodal Large Language Models (MLLMs) to tackle diverse domains that extend beyond the traditional language and vision tasks these models are typically trained on. Specifically, our focus lies in areas such as Embodied AI, Games, UI Control, and Planning. To this end, we introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA). GEA is a single unified model capable of grounding itself across these varied domains through a multi-embodiment action tokenizer. GEA is trained with supervised learning on a large dataset of embodied experiences and with online RL in interactive simulators. We explore the data and algorithmic choices necessary to develop such a model. Our findings reveal the importance of training with cross-domain data and online RL for building generalist agents. The final GEA model achieves strong generalization performance to unseen tasks across diverse benchmarks compared to other generalist models and benchmark-specific approaches.

Related papers

MaskSearch: A Universal Pre-Training Framework to Enhance Agentic Search Capability [106.35604230971396]
Recent advancements in Agent techniques enable Large Language Models (LLMs) to autonomously utilize tools for retrieval, planning, and reasoning.<n>To further enhance the universal search capability of agents, we propose a novel pre-training framework, MaskSearch.<n>In the pre-training stage, we introduce the Retrieval Augmented Mask Prediction (RAMP) task, where the model learns to leverage search tools to fill masked spans.<n>After that, the model is trained on downstream tasks to achieve further improvement.
arXiv Detail & Related papers (2025-05-26T17:58:50Z)
General-Reasoner: Advancing LLM Reasoning Across All Domains [64.70599911897595]
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains.<n>We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc.
arXiv Detail & Related papers (2025-05-20T17:41:33Z)
On Domain-Specific Post-Training for Multimodal Large Language Models [72.67107077850939]
This paper systematically investigates domain adaptation of MLLMs through post-training. We focus on data synthesis, training pipelines, and task evaluation. We conduct experiments in high-impact domains such as biomedicine, food, and remote sensing.
arXiv Detail & Related papers (2024-11-29T18:42:28Z)
Meta-DT: Offline Meta-RL as Conditional Sequence Modeling with World Model Disentanglement [41.7426496795769]
We propose Meta Decision Transformer (Meta-DT) to achieve efficient generalization in offline meta-RL. We pretrain a context-aware world model to learn a compact task representation, and inject it as a contextual condition to guide task-oriented sequence generation. We show that Meta-DT exhibits superior few and zero-shot generalization capacity compared to strong baselines.
arXiv Detail & Related papers (2024-10-15T09:51:30Z)
Learning to Generalize Unseen Domains via Multi-Source Meta Learning for Text Classification [71.08024880298613]
We study the multi-source Domain Generalization of text classification. We propose a framework to use multiple seen domains to train a model that can achieve high accuracy in an unseen domain.
arXiv Detail & Related papers (2024-09-20T07:46:21Z)
VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks [6.731844884087068]
We propose VolDoGer: Vision-Language dataset for Domain Generalization. This dataset addresses three vision-language tasks: image captioning, visual question answering, and visual entailment. We extend LLM-based data annotation techniques to vision-language tasks, thereby alleviating the burden of recruiting human annotators.
arXiv Detail & Related papers (2024-07-29T08:38:46Z)
GenRL: Multimodal-foundation world models for generalization in embodied agents [12.263162194821787]
Reinforcement learning (RL) is hard to scale up as it requires a complex reward design for each task. Current foundation vision-language models (VLMs) require fine-tuning or other adaptations to be adopted in embodied contexts. Lack of multimodal data in such domains represents an obstacle to developing foundation models for embodied applications.
arXiv Detail & Related papers (2024-06-26T03:41:48Z)
Grounding Multimodal Large Language Models in Actions [65.88208317380793]
We study how to best ground a MLLM into different embodiments and their associated action spaces.<n>For continuous actions, we show that a learned tokenization allows for sufficient modeling precision.<n>For discrete actions, we demonstrate that semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance.
arXiv Detail & Related papers (2024-06-12T06:12:04Z)
SilverSight: A Multi-Task Chinese Financial Large Language Model Based on Adaptive Semantic Space Learning [4.540505713937026]
This study introduces an Adaptive Semantic Space Learning (ASSL) framework to enhance the performance and selection efficacy of multi-expert models. Our research findings demonstrate that our framework can achieve results close to those obtained with full data training using only 10% of the data, while also exhibiting strong generalization capabilities.
arXiv Detail & Related papers (2024-04-07T13:02:21Z)
Unveiling the Generalization Power of Fine-Tuned Large Language Models [81.70754292058258]
We investigate whether fine-tuning affects the intrinsic generalization ability intrinsic to Large Language Models (LLMs) Our main findings reveal that models fine-tuned on generation and classification tasks exhibit dissimilar behaviors in generalizing to different domains and tasks. We observe that integrating the in-context learning strategy during fine-tuning on generation tasks can enhance the model's generalization ability.
arXiv Detail & Related papers (2024-03-14T08:18:59Z)
An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z)
Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations [53.76682562935373]
We introduce an efficient framework called textbfInteRecAgent, which employs LLMs as the brain and recommender models as tools. InteRecAgent achieves satisfying performance as a conversational recommender system, outperforming general-purpose LLMs.
arXiv Detail & Related papers (2023-08-31T07:36:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.