Related papers: A LLM Benchmark based on the Minecraft Builder Dialog Agent Task

A LLM Benchmark based on the Minecraft Builder Dialog Agent Task

URL: http://arxiv.org/abs/2407.12734v1
Date: Wed, 17 Jul 2024 16:52:23 GMT
Title: A LLM Benchmark based on the Minecraft Builder Dialog Agent Task
Authors: Chris Madge, Massimo Poesio,
Abstract summary: This work proposes adapting the Minecraft builder task into an LLM benchmark suitable for evaluating LLM ability in spatially orientated tasks. We believe this approach allows us to probe specific strengths and weaknesses of different agents, and test the ability of LLMs in the challenging area of spatial reasoning and vector based math.
Score: 5.555936227537389
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this work we proposing adapting the Minecraft builder task into an LLM benchmark suitable for evaluating LLM ability in spatially orientated tasks, and informing builder agent design. Previous works have proposed corpora with varying complex structures, and human written instructions. We instead attempt to provide a comprehensive synthetic benchmark for testing builder agents over a series of distinct tasks that comprise of common building operations. We believe this approach allows us to probe specific strengths and weaknesses of different agents, and test the ability of LLMs in the challenging area of spatial reasoning and vector based math.

Related papers

IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions [12.218102495632937]
Large language models (LLMs) demonstrate strong potential as agents for tool invocation due to their advanced comprehension and planning capabilities. We propose the Multi-Mission Tool Bench. In the benchmark, each test case comprises multiple interrelated missions. We also propose a novel method to evaluate the accuracy and efficiency of agent decisions with dynamic decision trees.
arXiv Detail & Related papers (2025-04-03T14:21:33Z)
APT: Architectural Planning and Text-to-Blueprint Construction Using Large Language Models for Open-World Agents [8.479128275067742]
We present an advanced Large Language Model (LLM)-driven framework that enables autonomous agents to construct complex structures in Minecraft. By employing chain-of-thought decomposition along with multimodal inputs, the framework generates detailed architectural layouts and blueprints. Our agent incorporates both memory and reflection modules to facilitate lifelong learning, adaptive refinement, and error correction throughout the building process.
arXiv Detail & Related papers (2024-11-26T09:31:28Z)
WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks [85.95607119635102]
Large language models (LLMs) can mimic human-like intelligence. WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents.
arXiv Detail & Related papers (2024-07-07T07:15:49Z)
Retrieval-Augmented Code Generation for Situated Action Generation: A Case Study on Minecraft [18.256529559741075]
In the Minecraft Collaborative Building Task, two players collaborate: an Architect (A) provides instructions to a Builder (B) to assemble a specified structure using 3D blocks. We investigate the use of large language models (LLMs) to predict the sequence of actions taken by the Builder.
arXiv Detail & Related papers (2024-06-25T13:43:24Z)
A Survey of Useful LLM Evaluation [20.048914787813263]
Two-stage framework: from core ability'' to agent'' In the "core ability" stage, we discussed the reasoning ability, societal impact, and domain knowledge of LLMs. In the agent'' stage, we demonstrated embodied action, planning, and tool learning of LLMs agent applications.
arXiv Detail & Related papers (2024-06-03T02:20:03Z)
PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels. We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings. We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z)
Large Language Models as Minecraft Agents [6.563602649100242]
We examine the use of Large Language Models (LLMs) in the challenging setting of acting as a Minecraft agent. We introduce clarification questions and examine the challenges and opportunities for improvement.
arXiv Detail & Related papers (2024-02-13T11:37:30Z)
Small LLMs Are Weak Tool Learners: A Multi-LLM Agent [73.54562551341454]
Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs. We propose a novel approach that decomposes the aforementioned capabilities into a planner, caller, and summarizer. This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability.
arXiv Detail & Related papers (2024-01-14T16:17:07Z)
TaskBench: Benchmarking Large Language Models for Task Automation [82.2932794189585]
We introduce TaskBench, a framework to evaluate the capability of large language models (LLMs) in task automation. Specifically, task decomposition, tool selection, and parameter prediction are assessed. Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation.
arXiv Detail & Related papers (2023-11-30T18:02:44Z)
BOLAA: Benchmarking and Orchestrating LLM-augmented Autonomous Agents [103.28404907655542]
Large language models (LLMs) have led to the emerging exploration of Autonomous Agents (LAAs) This paper provides a comprehensive comparison of LAA in terms of both agent architectures and LLM backbones. We propose a new strategy to orchestrate multiple LAAs such that each labor LAA focuses on one type of action, textiti.e. BOLAA, where a controller manages the communication among multiple agents.
arXiv Detail & Related papers (2023-08-11T06:37:54Z)
AgentBench: Evaluating LLMs as Agents [88.45506148281379]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. We present AgentBench, a benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z)
TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage [28.554981886052953]
Large Language Models (LLMs) have emerged as powerful tools for various real-world applications. Despite their prowess, intrinsic generative abilities of LLMs may prove insufficient for handling complex tasks. This paper proposes a structured framework tailored for LLM-based AI Agents.
arXiv Detail & Related papers (2023-08-07T09:22:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.