Related papers: Clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents

Clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents

URL: http://arxiv.org/abs/2305.13455v3
Date: Thu, 23 Nov 2023 15:47:52 GMT
Title: Clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents
Authors: Kranti Chalamalasetti and Jana G\"otze and Sherzod Hakimov and Brielen Madureira and Philipp Sadler and David Schlangen
Abstract summary: Recent work has proposed a methodology for the systematic evaluation of "Situated Language Understanding Agents" This paper explores: Can Large Language Models be evaluated meaningfully by exposing them to constrained game-like settings? As a proof of concept, this paper investigates five interaction settings, showing that current chat-optimised LLMs are, to an extent, capable to follow game-play instructions.
Score: 20.202525145391093
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work has proposed a methodology for the systematic evaluation of "Situated Language Understanding Agents"-agents that operate in rich linguistic and non-linguistic contexts-through testing them in carefully constructed interactive settings. Other recent work has argued that Large Language Models (LLMs), if suitably set up, can be understood as (simulators of) such agents. A connection suggests itself, which this paper explores: Can LLMs be evaluated meaningfully by exposing them to constrained game-like settings that are built to challenge specific capabilities? As a proof of concept, this paper investigates five interaction settings, showing that current chat-optimised LLMs are, to an extent, capable to follow game-play instructions. Both this capability and the quality of the game play, measured by how well the objectives of the different games are met, follows the development cycle, with newer models performing better. The metrics even for the comparatively simple example games are far from being saturated, suggesting that the proposed instrument will remain to have diagnostic value. Our general framework for implementing and evaluating games with LLMs is available at https://github.com/clembench .

Related papers

V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models [84.27290155010533]
V-MAGE is a game-based evaluation framework designed to assess visual reasoning capabilities of MLLMs. We use V-MAGE to evaluate leading MLLMs, revealing significant challenges in their visual perception and reasoning.
arXiv Detail & Related papers (2025-04-08T15:43:01Z)
Exploring Large Language Models for Word Games:Who is the Spy? [0.0]
This study explores how large language models (LLMs) can be effectively involved in word games. We introduce a Chain-of-Thought (CoT)-based scheduling framework to enable LLMs to achieve excellent performance in tasks such as inferring role words and disguising their identities.
arXiv Detail & Related papers (2025-03-19T14:13:02Z)
RPGBENCH: Evaluating Large Language Models as Role-Playing Game Engines [34.002194150560086]
We present RPGBench, the first benchmark designed to evaluate large language models (LLMs) as text-based role-playing game (RPG) engines. RPGBench comprises two core tasks: Game Creation (GC) and Game Simulation (GS)
arXiv Detail & Related papers (2025-02-01T23:40:24Z)
Evaluating Creativity and Deception in Large Language Models: A Simulation Framework for Multi-Agent Balderdash [6.65572931991284]
Large Language Models (LLMs) have shown impressive capabilities in complex tasks and interactive environments. This paper introduces a simulation framework utilizing the game Balderdash to evaluate both the creativity and logical reasoning of LLMs.
arXiv Detail & Related papers (2024-11-15T18:42:48Z)
Evaluating and Enhancing LLMs Agent based on Theory of Mind in Guandan: A Multi-Player Cooperative Game under Imperfect Information [36.11862095329315]
Large language models (LLMs) have shown success in handling simple games with imperfect information. This study investigates the applicability of knowledge acquired by open-source and API-based LLMs to sophisticated text-based games.
arXiv Detail & Related papers (2024-08-05T15:36:46Z)
clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents [19.989503513817095]
Large Language Models can be prompted to "self-play" conversational games that probe certain capabilities. We take one of the proposed frameworks for setting up such game-play environments, and test its usefulness as an evaluation instrument.
arXiv Detail & Related papers (2024-05-31T14:43:31Z)
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [102.41118020705876]
Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing. As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework. This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z)
Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations [70.7884839812069]
Large language models (LLMs) have emerged as powerful and general solutions to many natural language tasks. However, many of the most important applications of language generation are interactive, where an agent has to talk to a person to reach a desired outcome. In this work, we explore a new method for adapting LLMs with RL for such goal-directed dialogue.
arXiv Detail & Related papers (2023-11-09T18:45:16Z)
Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models [105.39236338147715]
The paper is inspired by the popular language game Who is Spy'' We develop DEEP to evaluate LLMs' expression and disguising abilities. We then introduce SpyGame, an interactive multi-agent framework.
arXiv Detail & Related papers (2023-10-31T14:37:42Z)
GameEval: Evaluating LLMs on Conversational Games [93.40433639746331]
We propose GameEval, a novel approach to evaluating large language models (LLMs) GameEval treats LLMs as game players and assigns them distinct roles with specific goals achieved by launching conversations of various forms. We show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems.
arXiv Detail & Related papers (2023-08-19T14:33:40Z)
SPRING: Studying the Paper and Reasoning to Play Games [102.5587155284795]
We propose a novel approach, SPRING, to read the game's original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM) In experiments, we study the quality of in-context "reasoning" induced by different forms of prompts under the setting of the Crafter open-world environment. Our experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories.
arXiv Detail & Related papers (2023-05-24T18:14:35Z)
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [111.33545170562337]
We investigate the possibility of grounding high-level tasks, expressed in natural language, to a chosen set of actionable steps. We find that if pre-trained LMs are large enough and prompted appropriately, they can effectively decompose high-level tasks into low-level plans. We propose a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions.
arXiv Detail & Related papers (2022-01-18T18:59:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.