MCU: A Task-centric Framework for Open-ended Agent Evaluation in
Minecraft
- URL: http://arxiv.org/abs/2310.08367v1
- Date: Thu, 12 Oct 2023 14:38:25 GMT
- Title: MCU: A Task-centric Framework for Open-ended Agent Evaluation in
Minecraft
- Authors: Haowei Lin, Zihao Wang, Jianzhu Ma, Yitao Liang
- Abstract summary: This paper introduces a task-centric framework named MCU for Minecraft agent evaluation.
Within the MCU framework, each task is measured with six distinct difficulty scores.
We show that MCU has the high expressivity to cover all tasks used in recent literature on Minecraft agent.
- Score: 28.585449904964033
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To pursue the goal of creating an open-ended agent in Minecraft, an
open-ended game environment with unlimited possibilities, this paper introduces
a task-centric framework named MCU for Minecraft agent evaluation. The MCU
framework leverages the concept of atom tasks as fundamental building blocks,
enabling the generation of diverse or even arbitrary tasks. Within the MCU
framework, each task is measured with six distinct difficulty scores (time
consumption, operational effort, planning complexity, intricacy, creativity,
novelty). These scores offer a multi-dimensional assessment of a task from
different angles, and thus can reveal an agent's capability on specific facets.
The difficulty scores also serve as the feature of each task, which creates a
meaningful task space and unveils the relationship between tasks. For efficient
evaluation of Minecraft agents employing the MCU framework, we maintain a
unified benchmark, namely SkillForge, which comprises representative tasks with
diverse categories and difficulty distribution. We also provide convenient
filters for users to select tasks to assess specific capabilities of agents. We
show that MCU has the high expressivity to cover all tasks used in recent
literature on Minecraft agent, and underscores the need for advancements in
areas such as creativity, precise control, and out-of-distribution
generalization under the goal of open-ended Minecraft agent development.
Related papers
- HarmonicEval: Multi-modal, Multi-task, Multi-criteria Automatic Evaluation Using a Vision Language Model [42.62148712511799]
Vision-language models (VLMs) have shown impressive abilities in text and image understanding.
Existing metrics for evaluating the text generated by VLMs focus exclusively on overall quality.
We propose HarmonicEval, a reference-free evaluation metric that aggregates criterion-wise scores to produce the overall score in a bottom-up manner.
arXiv Detail & Related papers (2024-12-19T08:03:16Z) - Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks [112.7791602217381]
We present Dynamic-SUPERB Phase-2, an open benchmark for the comprehensive evaluation of instruction-based universal speech models.
Building upon the first generation, this second version incorporates 125 new tasks, expanding the benchmark to a total of 180 tasks.
Evaluation results indicate that none of the models performed well universally.
arXiv Detail & Related papers (2024-11-08T06:33:22Z) - MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups.
It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics.
With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech [107.81472531864195]
Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions.
We present Dynamic-SUPERB, a benchmark for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion.
arXiv Detail & Related papers (2023-09-18T06:43:30Z) - MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models.
MMBench is meticulously curated with well-designed quality control schemes.
MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z) - Brain in a Vat: On Missing Pieces Towards Artificial General
Intelligence in Large Language Models [83.63242931107638]
We propose four characteristics of generally intelligent agents.
We argue that active engagement with objects in the real world delivers more robust signals for forming conceptual representations.
We conclude by outlining promising future research directions in the field of artificial general intelligence.
arXiv Detail & Related papers (2023-07-07T13:58:16Z) - Feature-Attending Recurrent Modules for Generalization in Reinforcement
Learning [27.736730414205137]
"Feature- Recurrent Modules" (FARM) is an architecture for learning state representations that relies on simple, broadly applicable inductive biases for spatial and temporal regularities.
FARM learns a state representation that is distributed across multiple modules that each attend to capturing features with an expressive feature attention mechanism.
We show that this improves an RL agents ability to generalize across object-centric tasks.
arXiv Detail & Related papers (2021-12-15T12:48:12Z) - Procedural Generalization by Planning with Self-Supervised World Models [10.119257232716834]
We measure the generalization ability of model-based agents in comparison to their model-free counterparts.
We identify three factors of procedural generalization -- planning, self-supervised representation learning, and procedural data diversity.
We find that these factors do not always provide the same benefits for the task generalization.
arXiv Detail & Related papers (2021-11-02T13:32:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.