Related papers: Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts

Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts

URL: http://arxiv.org/abs/2506.10357v1
Date: Thu, 12 Jun 2025 05:29:40 GMT
Title: Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts
Authors: Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Weili Guan, Dongmei Jiang, Liqiang Nie,
Abstract summary: We present Optimus-3, a general-purpose agent for Minecraft.<n>We propose a knowledge-enhanced data generation pipeline to provide scalable and high-quality training data for agent development.<n>We develop a Multimodal Reasoning-Augmented Reinforcement Learning approach to enhance the agent's reasoning ability for visual diversity.
Score: 54.21319853862452
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, agents based on multimodal large language models (MLLMs) have achieved remarkable progress across various domains. However, building a generalist agent with capabilities such as perception, planning, action, grounding, and reflection in open-world environments like Minecraft remains challenges: insufficient domain-specific data, interference among heterogeneous tasks, and visual diversity in open-world settings. In this paper, we address these challenges through three key contributions. 1) We propose a knowledge-enhanced data generation pipeline to provide scalable and high-quality training data for agent development. 2) To mitigate interference among heterogeneous tasks, we introduce a Mixture-of-Experts (MoE) architecture with task-level routing. 3) We develop a Multimodal Reasoning-Augmented Reinforcement Learning approach to enhance the agent's reasoning ability for visual diversity in Minecraft. Built upon these innovations, we present Optimus-3, a general-purpose agent for Minecraft. Extensive experimental results demonstrate that Optimus-3 surpasses both generalist multimodal large language models and existing state-of-the-art agents across a wide range of tasks in the Minecraft environment. Project page: https://cybertronagent.github.io/Optimus-3.github.io/

Related papers

Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy [50.13429055093534]
Optimus-2 is a novel Minecraft agent that incorporates a Multimodal Large Language Model (MLLM) for high-level planning.<n>We introduce a high-quality Minecraft Goal-Observation-Action (MGOA) dataset, which contains 25,000 videos across 8 atomic tasks.<n>Optimus-2 exhibits superior performance across atomic tasks, long-horizon tasks, and open-ended instruction tasks in Minecraft.
arXiv Detail & Related papers (2025-02-27T09:18:04Z)
TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft [40.419794780178044]
We present TeamCraft, a multi-modal multi-agent benchmark built on top of the open-world video game Minecraft.<n>The benchmark features 55,000 task variants specified by multi-modal prompts, procedurally-generated expert demonstrations for imitation learning, and carefully designed protocols to evaluate model generalization capabilities.<n>Our results indicate that existing models continue to face significant challenges in generalizing to novel goals, scenes, and unseen numbers of agents.
arXiv Detail & Related papers (2024-12-06T18:41:16Z)
Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks [50.13429055093534]
We propose a Hybrid Multimodal Memory module to address the above challenges. It transforms knowledge into Hierarchical Directed Knowledge Graph that allows agents to explicitly represent and learn world knowledge. It also summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector.
arXiv Detail & Related papers (2024-08-07T08:16:32Z)
Odyssey: Empowering Minecraft Agents with Open-World Skills [26.537984734738764]
We introduce Odyssey, a new framework that empowers Large Language Model (LLM)-based agents with open-world skills to explore the vast Minecraft world.<n>Odyssey comprises three key parts: (1) An interactive agent with an open-world skill library that consists of 40 primitive skills and 183 compositional skills; (2) A fine-tuned LLaMA-3 model trained on a large question-answering dataset with 390k+ instruction entries derived from the Minecraft Wiki; and (3) A new agent capability benchmark.
arXiv Detail & Related papers (2024-07-22T02:06:59Z)
LEGENT: Open Platform for Embodied Agents [60.71847900126832]
We introduce LEGENT, an open, scalable platform for developing embodied agents using Large Language Models (LLMs) and Large Multimodal Models (LMMs) LEGENT offers a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface. In experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks.
arXiv Detail & Related papers (2024-04-28T16:50:12Z)
Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory [97.87093169454431]
Ghost in the Minecraft (GITM) is a novel framework that integrates Large Language Models (LLMs) with text-based knowledge and memory. We develop a set of structured actions and leverage LLMs to generate action plans for the agents to execute. The resulting LLM-based agent markedly surpasses previous methods, achieving a remarkable improvement of +47.5% in success rate.
arXiv Detail & Related papers (2023-05-25T17:59:49Z)
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge [70.47759528596711]
We introduce MineDojo, a new framework built on the popular Minecraft game. We propose a novel agent learning algorithm that leverages large pre-trained video-language models as a learned reward function. Our agent is able to solve a variety of open-ended tasks specified in free-form language without any manually designed dense shaping reward.
arXiv Detail & Related papers (2022-06-17T15:53:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.