GameEval: Evaluating LLMs on Conversational Games
- URL: http://arxiv.org/abs/2308.10032v1
- Date: Sat, 19 Aug 2023 14:33:40 GMT
- Title: GameEval: Evaluating LLMs on Conversational Games
- Authors: Dan Qiao, Chenfei Wu, Yaobo Liang, Juntao Li, Nan Duan
- Abstract summary: We propose GameEval, a novel approach to evaluating large language models (LLMs)
GameEval treats LLMs as game players and assigns them distinct roles with specific goals achieved by launching conversations of various forms.
We show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems.
- Score: 93.40433639746331
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancements in large language models (LLMs) have presented
challenges in evaluating those models. Existing evaluation methods are either
reference-based or preference based, which inevitably need human intervention
or introduce test bias caused by evaluator models. In this paper, we propose
GameEval, a novel approach to evaluating LLMs through goal-driven
conversational games, overcoming the limitations of previous methods. GameEval
treats LLMs as game players and assigns them distinct roles with specific goals
achieved by launching conversations of various forms, including discussion,
question answering, and voting. We design three unique games with cooperative
or adversarial objectives, accompanied by corresponding evaluation metrics, to
show how this new paradigm comprehensively evaluates model performance.Through
extensive experiments, we show that GameEval can effectively differentiate the
capabilities of various LLMs, providing a comprehensive assessment of their
integrated abilities to solve complex problems. Our public anonymous code is
available at https://github.com/GameEval/GameEval.
Related papers
- Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests [89.09172401497213]
We examine three evaluation paradigms: large question-answering benchmarks, interactive games, and cognitive tests.
We compile a suite of targeted tests that measure cognitive abilities deemed essential for effective language use.
Our analyses reveal that interactive games are superior to standard benchmarks in discriminating models.
arXiv Detail & Related papers (2025-02-20T08:36:58Z) - RPGBENCH: Evaluating Large Language Models as Role-Playing Game Engines [34.002194150560086]
We present RPGBench, the first benchmark designed to evaluate large language models (LLMs) as text-based role-playing game (RPG) engines.
RPGBench comprises two core tasks: Game Creation (GC) and Game Simulation (GS)
arXiv Detail & Related papers (2025-02-01T23:40:24Z) - Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games [54.49589494014147]
GAMEBoT is a gaming arena designed for rigorous assessment of Large Language Models.
We benchmark 17 prominent LLMs across eight games, encompassing various strategic abilities and game characteristics.
Our results suggest that GAMEBoT presents a significant challenge, even when LLMs are provided with detailed CoT prompts.
arXiv Detail & Related papers (2024-12-18T08:32:53Z) - GameArena: Evaluating LLM Reasoning through Live Computer Games [25.415321902887598]
We introduce GameArena, a benchmark to evaluate large language models (LLMs) reasoning capabilities through interactive gameplay with humans.
GameArena consists of three games to test specific reasoning capabilities (e.g., deductive and inductive reasoning) while keeping participants entertained and engaged.
We collect over 2000 game sessions and provide detailed assessments of various reasoning capabilities for five state-of-the-art LLMs.
arXiv Detail & Related papers (2024-12-09T11:22:59Z) - Evaluating Creativity and Deception in Large Language Models: A Simulation Framework for Multi-Agent Balderdash [6.65572931991284]
Large Language Models (LLMs) have shown impressive capabilities in complex tasks and interactive environments.
This paper introduces a simulation framework utilizing the game Balderdash to evaluate both the creativity and logical reasoning of LLMs.
arXiv Detail & Related papers (2024-11-15T18:42:48Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z) - A Comprehensive Analysis of the Effectiveness of Large Language Models
as Automatic Dialogue Evaluators [46.939611070781794]
Large language models (LLMs) are shown to be promising substitutes for human judges.
We analyze the multi-dimensional evaluation capability of 30 recently emerged LLMs at both turn and dialogue levels.
We also probe the robustness of the LLMs in handling various adversarial perturbations at both turn and dialogue levels.
arXiv Detail & Related papers (2023-12-24T04:50:57Z) - Leveraging Word Guessing Games to Assess the Intelligence of Large
Language Models [105.39236338147715]
The paper is inspired by the popular language game Who is Spy''
We develop DEEP to evaluate LLMs' expression and disguising abilities.
We then introduce SpyGame, an interactive multi-agent framework.
arXiv Detail & Related papers (2023-10-31T14:37:42Z) - Clembench: Using Game Play to Evaluate Chat-Optimized Language Models as
Conversational Agents [20.202525145391093]
Recent work has proposed a methodology for the systematic evaluation of "Situated Language Understanding Agents"
This paper explores: Can Large Language Models be evaluated meaningfully by exposing them to constrained game-like settings?
As a proof of concept, this paper investigates five interaction settings, showing that current chat-optimised LLMs are, to an extent, capable to follow game-play instructions.
arXiv Detail & Related papers (2023-05-22T19:56:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.