Related papers: Voyager: An Open-Ended Embodied Agent with Large Language Models

Voyager: An Open-Ended Embodied Agent with Large Language Models

URL: http://arxiv.org/abs/2305.16291v2
Date: Thu, 19 Oct 2023 16:27:03 GMT
Title: Voyager: An Open-Ended Embodied Agent with Large Language Models
Authors: Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, Anima Anandkumar
Abstract summary: Voyager is the first embodied lifelong learning agent in Minecraft. It continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch.
Score: 103.76509266014165
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager consists of three key components: 1) an automatic curriculum that maximizes exploration, 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent's abilities rapidly and alleviates catastrophic forgetting. Empirically, Voyager shows strong in-context lifelong learning capability and exhibits exceptional proficiency in playing Minecraft. It obtains 3.3x more unique items, travels 2.3x longer distances, and unlocks key tech tree milestones up to 15.3x faster than prior SOTA. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize. We open-source our full codebase and prompts at https://voyager.minedojo.org/.

Related papers

VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems [50.97354139604596]
This paper proposes VoyagerVision, a model capable of creating structures within Minecraft using screenshots as a form of visual feedback.<n>VoyagerVision was successful in half of all attempts in flat worlds, with most failures arising in more complex structures.
arXiv Detail & Related papers (2025-06-29T14:16:11Z)
Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills [57.740236400672046]
We propose a Hierarchical Multimodal Skills (HMS) module to tackle the issue of insufficient knowledge.<n>It progressively abstracts trajectories into execution skills, core skills, and ultimately meta-skills, providing a hierarchical knowledge structure for long-horizon task planning.<n>To bridge the domain gap, we propose the Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm.
arXiv Detail & Related papers (2025-06-12T06:21:19Z)
Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts [54.21319853862452]
We present Optimus-3, a general-purpose agent for Minecraft.<n>We propose a knowledge-enhanced data generation pipeline to provide scalable and high-quality training data for agent development.<n>We develop a Multimodal Reasoning-Augmented Reinforcement Learning approach to enhance the agent's reasoning ability for visual diversity.
arXiv Detail & Related papers (2025-06-12T05:29:40Z)
Voyager: Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation [66.95956271144982]
We present Voyager, a novel video diffusion framework that generates world-consistent 3D point-cloud sequences from a single image.<n>Unlike existing approaches, Voyager achieves end-to-end scene generation and reconstruction with inherent consistency across frames.
arXiv Detail & Related papers (2025-06-04T17:59:04Z)
MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning [3.187381965457262]
collabvoyager is a novel framework that enhances Voyager with lifelong collaborative learning through explicit perspective-taking. collabvoyager introduces three key innovations: (1) theory of mind representations linking percepts, beliefs, desires, and actions; (2) natural language communication between agents; and (3) semantic memory of task and environment knowledge. In mixed-expertise Minecraft experiments, collabvoyager agents outperform Voyager counterparts, significantly improving task completion rate by $66.6% (+39.4%)$ for collecting one block of dirt and $70.8% (+20.8%)$ for
arXiv Detail & Related papers (2024-11-20T02:10:44Z)
O1 Replication Journey: A Strategic Progress Report -- Part 1 [52.062216849476776]
This paper introduces a pioneering approach to artificial intelligence research, embodied in our O1 Replication Journey. Our methodology addresses critical challenges in modern AI research, including the insularity of prolonged team-based projects. We propose the journey learning paradigm, which encourages models to learn not just shortcuts, but the complete exploration process.
arXiv Detail & Related papers (2024-10-08T15:13:01Z)
Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks [50.13429055093534]
We propose a Hybrid Multimodal Memory module to address the above challenges. It transforms knowledge into Hierarchical Directed Knowledge Graph that allows agents to explicitly represent and learn world knowledge. It also summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector.
arXiv Detail & Related papers (2024-08-07T08:16:32Z)
Odyssey: Empowering Minecraft Agents with Open-World Skills [26.537984734738764]
We introduce Odyssey, a new framework that empowers Large Language Model (LLM)-based agents with open-world skills to explore the vast Minecraft world. Odyssey comprises three key parts: (1) An interactive agent with an open-world skill library that consists of 40 primitive skills and 183 compositional skills; (2) A fine-tuned LLaMA-3 model trained on a large question-answering dataset with 390k+ instruction entries derived from the Minecraft Wiki; and (3) A new agent capability benchmark.
arXiv Detail & Related papers (2024-07-22T02:06:59Z)
See and Think: Embodied Agent in Virtual Environment [12.801720916220823]
Large language models (LLMs) have achieved impressive pro-gress on several open-world tasks. This paper proposes STEVE, a comprehensive and visionary embodied agent in the Minecraft virtual environment.
arXiv Detail & Related papers (2023-11-26T06:38:16Z)
JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models [38.77967315158286]
We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions) We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. JARVIS-1 is the existing most general agent in Minecraft, capable of completing over 200 different tasks using control and observation space similar to humans.
arXiv Detail & Related papers (2023-11-10T11:17:58Z)
Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory [97.87093169454431]
Ghost in the Minecraft (GITM) is a novel framework that integrates Large Language Models (LLMs) with text-based knowledge and memory. We develop a set of structured actions and leverage LLMs to generate action plans for the agents to execute. The resulting LLM-based agent markedly surpasses previous methods, achieving a remarkable improvement of +47.5% in success rate.
arXiv Detail & Related papers (2023-05-25T17:59:49Z)
Lana: A Language-Capable Navigator for Instruction Following and Generation [70.76686546473994]
LANA is a language-capable navigation agent which is able to execute human-written navigation commands and provide route descriptions to humans. We empirically verify that, compared with recent advanced task-specific solutions, LANA attains better performances on both instruction following and route description. In addition, endowed with language generation capability, LANA can explain to humans its behaviors and assist human's wayfinding.
arXiv Detail & Related papers (2023-03-15T07:21:28Z)
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge [70.47759528596711]
We introduce MineDojo, a new framework built on the popular Minecraft game. We propose a novel agent learning algorithm that leverages large pre-trained video-language models as a learned reward function. Our agent is able to solve a variety of open-ended tasks specified in free-form language without any manually designed dense shaping reward.
arXiv Detail & Related papers (2022-06-17T15:53:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.