Werewolf Arena: A Case Study in LLM Evaluation via Social Deduction
- URL: http://arxiv.org/abs/2407.13943v1
- Date: Thu, 18 Jul 2024 23:41:05 GMT
- Title: Werewolf Arena: A Case Study in LLM Evaluation via Social Deduction
- Authors: Suma Bailis, Jane Friedhoff, Feiyang Chen,
- Abstract summary: Werewolf Arena is a framework for evaluating large language models (LLMs)
In Werewolf Arena, LLMs compete against each other, navigating the game's complex dynamics of deception, deduction, and persuasion.
We demonstrate Werewolf Arena's utility through an arena-style tournament featuring Gemini and GPT models.
- Score: 3.350801757799469
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper introduces Werewolf Arena, a novel framework for evaluating large language models (LLMs) through the lens of the classic social deduction game, Werewolf. In Werewolf Arena, LLMs compete against each other, navigating the game's complex dynamics of deception, deduction, and persuasion. The framework introduces a dynamic turn-taking system based on bidding, mirroring real-world discussions where individuals strategically choose when to speak. We demonstrate the framework's utility through an arena-style tournament featuring Gemini and GPT models. Our results reveal distinct strengths and weaknesses in the models' strategic reasoning and communication. These findings highlight Werewolf Arena's potential as a challenging and scalable LLM benchmark.
Related papers
- Enhancing Dialogue Generation in Werewolf Game Through Situation Analysis and Persuasion Strategies [1.7725414095035827]
This paper introduces a LLM-based Werewolf Game AI, where each role is supported by situation analysis to aid response generation.
Various persuasion strategies are employed to effectively persuade other players to align with its actions.
arXiv Detail & Related papers (2024-08-29T14:49:13Z) - Learning to Discuss Strategically: A Case Study on One Night Ultimate Werewolf [28.57358844115881]
As a variant of the famous communication game Werewolf, One Night Ultimate Werewolf (ONUW) requires players to develop strategic discussion policies.
We propose an RL-instructed language agent framework, where a discussion policy trained by reinforcement learning (RL) is employed to determine appropriate discussion tactics to adopt.
arXiv Detail & Related papers (2024-05-30T11:07:06Z) - GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations [87.99872683336395]
Large Language Models (LLMs) are integrated into critical real-world applications.
This paper evaluates LLMs' reasoning abilities in competitive environments.
We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
arXiv Detail & Related papers (2024-02-19T18:23:36Z) - Enhance Reasoning for Large Language Models in the Game Werewolf [15.730860371636336]
This paper presents an innovative framework that integrates Large Language Models (LLMs) with an external Thinker module.
Our framework is presented using a 9-player Werewolf game that demands dual-system reasoning.
Experiments demonstrate the framework's effectiveness in deductive reasoning, speech generation, and online game evaluation.
arXiv Detail & Related papers (2024-02-04T03:47:10Z) - ALYMPICS: LLM Agents Meet Game Theory -- Exploring Strategic
Decision-Making with AI Agents [77.34720446306419]
Alympics is a systematic simulation framework utilizing Large Language Model (LLM) agents for game theory research.
Alympics creates a versatile platform for studying complex game theory problems.
arXiv Detail & Related papers (2023-11-06T16:03:46Z) - Leveraging Word Guessing Games to Assess the Intelligence of Large
Language Models [105.39236338147715]
The paper is inspired by the popular language game Who is Spy''
We develop DEEP to evaluate LLMs' expression and disguising abilities.
We then introduce SpyGame, an interactive multi-agent framework.
arXiv Detail & Related papers (2023-10-31T14:37:42Z) - LLM-Based Agent Society Investigation: Collaboration and Confrontation in Avalon Gameplay [55.12945794835791]
Using Avalon as a testbed, we employ system prompts to guide LLM agents in gameplay.
We propose a novel framework, tailored for Avalon, features a multi-agent system facilitating efficient communication and interaction.
Results affirm the framework's effectiveness in creating adaptive agents and suggest LLM-based agents' potential in navigating dynamic social interactions.
arXiv Detail & Related papers (2023-10-23T14:35:26Z) - Avalon's Game of Thoughts: Battle Against Deception through Recursive
Contemplation [80.126717170151]
This study utilizes the intricate Avalon game as a testbed to explore LLMs' potential in deceptive environments.
We introduce a novel framework, Recursive Contemplation (ReCon), to enhance LLMs' ability to identify and counteract deceptive information.
arXiv Detail & Related papers (2023-10-02T16:27:36Z) - Exploring Large Language Models for Communication Games: An Empirical Study on Werewolf [19.39740531672788]
We propose a tuning-free framework to engage large language models in communication games.
An empirical study on the representative and widely-studied communication game, Werewolf'', demonstrates that our framework can effectively play Werewolf game without tuning the parameters of the LLMs.
arXiv Detail & Related papers (2023-09-09T01:56:40Z) - GameEval: Evaluating LLMs on Conversational Games [93.40433639746331]
We propose GameEval, a novel approach to evaluating large language models (LLMs)
GameEval treats LLMs as game players and assigns them distinct roles with specific goals achieved by launching conversations of various forms.
We show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems.
arXiv Detail & Related papers (2023-08-19T14:33:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.