Related papers: Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

URL: http://arxiv.org/abs/2408.03615v2
Date: Mon, 21 Oct 2024 11:05:27 GMT
Title: Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks
Authors: Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, Liqiang Nie,
Abstract summary: We propose a Hybrid Multimodal Memory module to address the above challenges. It transforms knowledge into Hierarchical Directed Knowledge Graph that allows agents to explicitly represent and learn world knowledge. It also summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector.
Score: 50.13429055093534
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Building a general-purpose agent is a long-standing vision in the field of artificial intelligence. Existing agents have made remarkable progress in many domains, yet they still struggle to complete long-horizon tasks in an open world. We attribute this to the lack of necessary world knowledge and multimodal experience that can guide agents through a variety of long-horizon tasks. In this paper, we propose a Hybrid Multimodal Memory module to address the above challenges. It 1) transforms knowledge into Hierarchical Directed Knowledge Graph that allows agents to explicitly represent and learn world knowledge, and 2) summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector, contributing to a better planning and reflection in the face of long-horizon tasks in Minecraft. Extensive experimental results show that Optimus-1 significantly outperforms all existing agents on challenging long-horizon task benchmarks, and exhibits near human-level performance on many tasks. In addition, we introduce various Multimodal Large Language Models (MLLMs) as the backbone of Optimus-1. Experimental results show that Optimus-1 exhibits strong generalization with the help of the Hybrid Multimodal Memory module, outperforming the GPT-4V baseline on many tasks.

Related papers

Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills [57.740236400672046]
We propose a Hierarchical Multimodal Skills (HMS) module to tackle the issue of insufficient knowledge.<n>It progressively abstracts trajectories into execution skills, core skills, and ultimately meta-skills, providing a hierarchical knowledge structure for long-horizon task planning.<n>To bridge the domain gap, we propose the Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm.
arXiv Detail & Related papers (2025-06-12T06:21:19Z)
Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts [54.21319853862452]
We present Optimus-3, a general-purpose agent for Minecraft.<n>We propose a knowledge-enhanced data generation pipeline to provide scalable and high-quality training data for agent development.<n>We develop a Multimodal Reasoning-Augmented Reinforcement Learning approach to enhance the agent's reasoning ability for visual diversity.
arXiv Detail & Related papers (2025-06-12T05:29:40Z)
OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis [70.39500621448383]
Open-world mobile manipulation task remains a challenge due to the need for generalization to open-ended instructions and environments.<n>We propose a novel multi-modal agent architecture that maintains multi-view scene frames and agent states for decision-making and controls the robot by function calling.<n>We highlight our fine-tuned OWMM-VLM as the first dedicated foundation model for mobile manipulators with global scene understanding, robot state tracking, and multi-modal action generation in a unified model.
arXiv Detail & Related papers (2025-06-04T17:57:44Z)
AI2MMUM: AI-AI Oriented Multi-Modal Universal Model Leveraging Telecom Domain Large Model [8.404195378257178]
We propose a scalable, task-aware artificial intelligence-air interface multi-modal universal model (AI2MMUM)<n>To enhance task adaptability, task instructions consist of fixed task keywords and learnable, implicit prefix prompts.<n> lightweight task-specific heads are designed to directly output task objectives.
arXiv Detail & Related papers (2025-05-15T06:32:59Z)
Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions [12.218102495632937]
Large language models (LLMs) demonstrate strong potential as agents for tool invocation due to their advanced comprehension and planning capabilities. We propose the Multi-Mission Tool Bench. In the benchmark, each test case comprises multiple interrelated missions. We also propose a novel method to evaluate the accuracy and efficiency of agent decisions with dynamic decision trees.
arXiv Detail & Related papers (2025-04-03T14:21:33Z)
LaMMA-P: Generalizable Multi-Agent Long-Horizon Task Allocation and Planning with LM-Driven PDDL Planner [9.044939946653002]
Language models (LMs) possess a strong capability to comprehend natural language, making them effective in translating human instructions into detailed plans for simple robot tasks. We propose a Language Model-Driven Multi-Agent PDDL Planner (LaMMA-P), a novel multi-agent task planning framework. LaMMA-P integrates the strengths of the LMs' reasoning capability and the traditional search planner to achieve a high success rate and efficiency.
arXiv Detail & Related papers (2024-09-30T17:58:18Z)
Hybrid Training for Enhanced Multi-task Generalization in Multi-agent Reinforcement Learning [7.6201940008534175]
HyGen is a novel hybrid MARL framework, which integrates online and offline learning to ensure both multi-task generalization and training efficiency. We empirically demonstrate that our framework effectively extracts and refines general skills, yielding impressive generalization to unseen tasks.
arXiv Detail & Related papers (2024-08-24T12:37:03Z)
MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models [57.091523832149655]
We propose a mixture of multimodal experts (MoME) to mitigate task interference and obtain a generalist MLLM. Our MoME is composed of two key components, a mixture of vision experts (MoVE) and a mixture of language experts (MoLE)
arXiv Detail & Related papers (2024-07-17T16:31:38Z)
Synergistic Multi-Agent Framework with Trajectory Learning for Knowledge-Intensive Tasks [44.42989163847349]
Large Language Models (LLMs) have led to significant breakthroughs in various natural language processing tasks. generating factually consistent responses in knowledge-intensive scenarios remains a challenge. This paper introduces SMART, a novel multi-agent framework that leverages external knowledge to enhance the interpretability and factual consistency of LLM-generated responses.
arXiv Detail & Related papers (2024-07-13T13:58:24Z)
Do We Really Need a Complex Agent System? Distill Embodied Agent into a Single Model [15.558269067931374]
We propose STEVE-2, a hierarchical knowledge distillation framework for open-ended embodied tasks. After distillation, embodied agents can complete complex, open-ended tasks without additional expert guidance.
arXiv Detail & Related papers (2024-04-06T12:51:00Z)
Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives [56.2139730920855]
We present a systematic analysis of MM-VUFMs specifically designed for road scenes. Our objective is to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques. We provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
arXiv Detail & Related papers (2024-02-05T12:47:09Z)
Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z)
JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models [38.77967315158286]
We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions) We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. JARVIS-1 is the existing most general agent in Minecraft, capable of completing over 200 different tasks using control and observation space similar to humans.
arXiv Detail & Related papers (2023-11-10T11:17:58Z)
Agent Lumos: Unified and Modular Training for Open-Source Language Agents [89.78556964988852]
We introduce LUMOS, one of the first frameworks for training open-source LLM-based agents. LUMOS features a learnable, unified, and modular architecture with a planning module that learns high-level subgoal generation. We collect large-scale, unified, and high-quality training annotations derived from diverse ground-truth reasoning rationales.
arXiv Detail & Related papers (2023-11-09T00:30:13Z)
UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers [108.92194081987967]
We make the first attempt to explore a universal multi-agent reinforcement learning pipeline, designing one single architecture to fit tasks. Unlike previous RNN-based models, we utilize a transformer-based model to generate a flexible policy. The proposed model, named as Universal Policy Decoupling Transformer (UPDeT), further relaxes the action restriction and makes the multi-agent task's decision process more explainable.
arXiv Detail & Related papers (2021-01-20T07:24:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.