See and Think: Embodied Agent in Virtual Environment
- URL: http://arxiv.org/abs/2311.15209v3
- Date: Tue, 9 Jul 2024 05:01:06 GMT
- Title: See and Think: Embodied Agent in Virtual Environment
- Authors: Zhonghan Zhao, Wenhao Chai, Xuan Wang, Li Boyi, Shengyu Hao, Shidong Cao, Tian Ye, Gaoang Wang,
- Abstract summary: Large language models (LLMs) have achieved impressive pro-gress on several open-world tasks.
This paper proposes STEVE, a comprehensive and visionary embodied agent in the Minecraft virtual environment.
- Score: 12.801720916220823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have achieved impressive pro-gress on several open-world tasks. Recently, using LLMs to build embodied agents has been a hotspot. This paper proposes STEVE, a comprehensive and visionary embodied agent in the Minecraft virtual environment. STEVE comprises three key components: vision perception, language instruction, and code action. Vision perception involves interpreting visual information in the environment, which is then integrated into the LLMs component with agent state and task instruction. Language instruction is responsible for iterative reasoning and decomposing complex tasks into manageable guidelines. Code action generates executable skill actions based on retrieval in skill database, enabling the agent to interact effectively within the Minecraft environment. We also collect STEVE-21K dataset, which includes 600+ vision-environment pairs, 20K knowledge question-answering pairs, and 200+ skill-code pairs. We conduct continuous block search, knowledge question and answering, and tech tree mastery to evaluate the performance. Extensive experiments show that STEVE achieves at most 1.5x faster unlocking key tech trees and 2.5x quicker in block search tasks.
Related papers
- ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting [24.56720920528011]
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges.
A key issue is the difficulty in smoothly connecting individual entities in low-level observations with abstract concepts required for planning.
We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models.
arXiv Detail & Related papers (2024-10-23T13:26:59Z) - Odyssey: Empowering Minecraft Agents with Open-World Skills [26.537984734738764]
We introduce Odyssey, a new framework that empowers Large Language Model (LLM)-based agents with open-world skills to explore the vast Minecraft world.
Odyssey comprises three key parts: (1) An interactive agent with an open-world skill library that consists of 40 primitive skills and 183 compositional skills; (2) A fine-tuned LLaMA-3 model trained on a large question-answering dataset with 390k+ instruction entries derived from the Minecraft Wiki; and (3) A new agent capability benchmark.
arXiv Detail & Related papers (2024-07-22T02:06:59Z) - DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control [53.80518003412016]
Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research.
We study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair.
We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls.
arXiv Detail & Related papers (2024-07-20T05:39:28Z) - kNN-ICL: Compositional Task-Oriented Parsing Generalization with Nearest
Neighbor In-Context Learning [50.40636157214161]
Task-Oriented Parsing (TOP) enables conversational assistants to interpret user commands expressed in natural language.
LLMs have achieved impressive performance in computer programs based on a natural language prompt.
This paper focuses on harnessing the capabilities of LLMs for semantic parsing tasks.
arXiv Detail & Related papers (2023-12-17T17:26:50Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - VELMA: Verbalization Embodiment of LLM Agents for Vision and Language
Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.
We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples.
We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - Ghost in the Minecraft: Generally Capable Agents for Open-World
Environments via Large Language Models with Text-based Knowledge and Memory [97.87093169454431]
Ghost in the Minecraft (GITM) is a novel framework that integrates Large Language Models (LLMs) with text-based knowledge and memory.
We develop a set of structured actions and leverage LLMs to generate action plans for the agents to execute.
The resulting LLM-based agent markedly surpasses previous methods, achieving a remarkable improvement of +47.5% in success rate.
arXiv Detail & Related papers (2023-05-25T17:59:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.