Related papers: JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

URL: http://arxiv.org/abs/2311.05997v3
Date: Thu, 30 Nov 2023 07:39:48 GMT
Title: JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models
Authors: Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, Yitao Liang
Abstract summary: We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions) We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. JARVIS-1 is the existing most general agent in Minecraft, capable of completing over 200 different tasks using control and observation space similar to humans.
Score: 38.77967315158286
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. JARVIS-1 is the existing most general agent in Minecraft, capable of completing over 200 different tasks using control and observation space similar to humans. These tasks range from short-horizon tasks, e.g., "chopping trees" to long-horizon tasks, e.g., "obtaining a diamond pickaxe". JARVIS-1 performs exceptionally well in short-horizon tasks, achieving nearly perfect performance. In the classic long-term task of $\texttt{ObtainDiamondPickaxe}$, JARVIS-1 surpasses the reliability of current state-of-the-art agents by 5 times and can successfully complete longer-horizon and more challenging tasks. The project page is available at https://craftjarvis.org/JARVIS-1

Related papers

Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills [57.740236400672046]
We propose a Hierarchical Multimodal Skills (HMS) module to tackle the issue of insufficient knowledge.<n>It progressively abstracts trajectories into execution skills, core skills, and ultimately meta-skills, providing a hierarchical knowledge structure for long-horizon task planning.<n>To bridge the domain gap, we propose the Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm.
arXiv Detail & Related papers (2025-06-12T06:21:19Z)
Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts [54.21319853862452]
We present Optimus-3, a general-purpose agent for Minecraft.<n>We propose a knowledge-enhanced data generation pipeline to provide scalable and high-quality training data for agent development.<n>We develop a Multimodal Reasoning-Augmented Reinforcement Learning approach to enhance the agent's reasoning ability for visual diversity.
arXiv Detail & Related papers (2025-06-12T05:29:40Z)
Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy [50.13429055093534]
Optimus-2 is a novel Minecraft agent that incorporates a Multimodal Large Language Model (MLLM) for high-level planning. We introduce a high-quality Minecraft Goal-Observation-Action (MGOA) dataset, which contains 25,000 videos across 8 atomic tasks. Optimus-2 exhibits superior performance across atomic tasks, long-horizon tasks, and open-ended instruction tasks in Minecraft.
arXiv Detail & Related papers (2025-02-27T09:18:04Z)
ReLEP: A Novel Framework for Real-world Long-horizon Embodied Planning [7.668848364013772]
We present ReLEP, a framework for Real world Long-horizon Embodied Planning. At its core lies a fine-tuned large vision language model that formulates plans as sequences of skill functions. ReLEP can accomplish a wide range of daily tasks and outperforms other state-of-the-art baseline methods.
arXiv Detail & Related papers (2024-09-24T01:47:23Z)
Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks [50.13429055093534]
We propose a Hybrid Multimodal Memory module to address the above challenges. It transforms knowledge into Hierarchical Directed Knowledge Graph that allows agents to explicitly represent and learn world knowledge. It also summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector.
arXiv Detail & Related papers (2024-08-07T08:16:32Z)
Odyssey: Empowering Minecraft Agents with Open-World Skills [26.537984734738764]
We introduce Odyssey, a new framework that empowers Large Language Model (LLM)-based agents with open-world skills to explore the vast Minecraft world. Odyssey comprises three key parts: (1) An interactive agent with an open-world skill library that consists of 40 primitive skills and 183 compositional skills; (2) A fine-tuned LLaMA-3 model trained on a large question-answering dataset with 390k+ instruction entries derived from the Minecraft Wiki; and (3) A new agent capability benchmark.
arXiv Detail & Related papers (2024-07-22T02:06:59Z)
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception [53.20509532671891]
MP5 is an open-ended multimodal embodied system built upon the challenging Minecraft simulator. It can decompose feasible sub-objectives, design sophisticated situation-aware plans, and perform embodied action control.
arXiv Detail & Related papers (2023-12-12T17:55:45Z)
See and Think: Embodied Agent in Virtual Environment [12.801720916220823]
Large language models (LLMs) have achieved impressive pro-gress on several open-world tasks. This paper proposes STEVE, a comprehensive and visionary embodied agent in the Minecraft virtual environment.
arXiv Detail & Related papers (2023-11-26T06:38:16Z)
Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory [97.87093169454431]
Ghost in the Minecraft (GITM) is a novel framework that integrates Large Language Models (LLMs) with text-based knowledge and memory. We develop a set of structured actions and leverage LLMs to generate action plans for the agents to execute. The resulting LLM-based agent markedly surpasses previous methods, achieving a remarkable improvement of +47.5% in success rate.
arXiv Detail & Related papers (2023-05-25T17:59:49Z)
Voyager: An Open-Ended Embodied Agent with Large Language Models [103.76509266014165]
Voyager is the first embodied lifelong learning agent in Minecraft. It continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch.
arXiv Detail & Related papers (2023-05-25T17:46:38Z)
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge [70.47759528596711]
We introduce MineDojo, a new framework built on the popular Minecraft game. We propose a novel agent learning algorithm that leverages large pre-trained video-language models as a learned reward function. Our agent is able to solve a variety of open-ended tasks specified in free-form language without any manually designed dense shaping reward.
arXiv Detail & Related papers (2022-06-17T15:53:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.