Beyond Survival: Evaluating LLMs in Social Deduction Games with Human-Aligned Strategies
- URL: http://arxiv.org/abs/2510.11389v1
- Date: Mon, 13 Oct 2025 13:33:30 GMT
- Title: Beyond Survival: Evaluating LLMs in Social Deduction Games with Human-Aligned Strategies
- Authors: Zirui Song, Yuan Huang, Junchang Liu, Haozhe Luo, Chenxi Wang, Lang Gao, Zixiang Xu, Mingfei Han, Xiaojun Chang, Xiuying Chen,
- Abstract summary: Social deduction games like Werewolf combine language, reasoning, and strategy.<n>We curate a high-quality, human-verified multimodal Werewolf dataset containing over 100 hours of video, 32.4M utterance tokens, and 15 rule variants.<n>We propose a novel strategy-alignment evaluation that leverages the winning faction's strategies as ground truth in two stages.
- Score: 54.08697738311866
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Social deduction games like Werewolf combine language, reasoning, and strategy, providing a testbed for studying natural language and social intelligence. However, most studies reduce the game to LLM-based self-play, yielding templated utterances and anecdotal cases that overlook the richness of social gameplay. Evaluation further relies on coarse metrics such as survival time or subjective scoring due to the lack of quality reference data. To address these gaps, we curate a high-quality, human-verified multimodal Werewolf dataset containing over 100 hours of video, 32.4M utterance tokens, and 15 rule variants. Based on this dataset, we propose a novel strategy-alignment evaluation that leverages the winning faction's strategies as ground truth in two stages: 1) Speech evaluation, formulated as multiple-choice-style tasks that assess whether the model can adopt appropriate stances across five dimensions of social ability; and 2) Decision evaluation, which assesses the model's voting choices and opponent-role inferences. This framework enables a fine-grained evaluation of models' linguistic and reasoning capabilities, while capturing their ability to generate strategically coherent gameplay. Our experiments show that state-of-the-art LLMs show diverse performance, with roughly half remain below 0.50, revealing clear gaps in deception and counterfactual reasoning. We hope our dataset further inspires research on language, reasoning, and strategy in multi-agent interaction.
Related papers
- Game of Thought: Robust Information Seeking with Large Language Models Using Game Theory [37.51238507036326]
We use the game of Twenty Questions to evaluate the information-seeking ability of Large Language Models (LLMs)<n>We propose Game of Thought (GoT), a framework that applies game-theoretic techniques to approximate a Nash equilibrium (NE) strategy for the restricted variant of the game.
arXiv Detail & Related papers (2026-02-02T06:33:18Z) - Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models [57.33350664910483]
We introduce Squid Game, a dynamic and adversarial evaluation environment with resource-constrained and asymmetric information settings.<n>We evaluate over 50 LLMs on Squid Game, presenting the largest behavioral evaluation study of general LLMs on dynamic adversarial scenarios.
arXiv Detail & Related papers (2025-11-12T06:06:29Z) - TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation [55.55404595177229]
Large Language Models (LLMs) are exhibiting emergent human-like abilities.<n>TwinVoice is a benchmark for assessing persona simulation across diverse real-world contexts.
arXiv Detail & Related papers (2025-10-29T14:00:42Z) - LLMsPark: A Benchmark for Evaluating Large Language Models in Strategic Gaming Contexts [19.97430860742638]
We present a game theory-based evaluation platform that measures large language models' decision-making strategies and social behaviors in classic game-theoretic settings.<n>Our system cross-evaluates 15 leading LLMs using leaderboard rankings and scoring mechanisms.<n>This work introduces a novel perspective for evaluating LLMs' strategic intelligence, enriching existing benchmarks and broadening their assessment in interactive, game-theoretic scenarios.
arXiv Detail & Related papers (2025-09-20T10:21:17Z) - InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles [39.025684190110276]
Social deduction games provide a natural testbed for evaluating individualized reasoning styles.<n>We introduce InMind, a cognitively grounded evaluation framework designed to assess whether LLMs can capture and apply personalized reasoning styles.<n>As a case study, we apply InMind to the game Avalon, evaluating 11 state-of-the-art LLMs.
arXiv Detail & Related papers (2025-08-22T04:04:00Z) - KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation [78.96590724864606]
We introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium.<n>KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios.
arXiv Detail & Related papers (2025-05-20T16:06:32Z) - If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs [55.8331366739144]
We introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in large language models (LLMs)<n>Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches.
arXiv Detail & Related papers (2025-03-30T16:50:57Z) - Leveraging Word Guessing Games to Assess the Intelligence of Large
Language Models [105.39236338147715]
The paper is inspired by the popular language game Who is Spy''
We develop DEEP to evaluate LLMs' expression and disguising abilities.
We then introduce SpyGame, an interactive multi-agent framework.
arXiv Detail & Related papers (2023-10-31T14:37:42Z) - GameEval: Evaluating LLMs on Conversational Games [93.40433639746331]
We propose GameEval, a novel approach to evaluating large language models (LLMs)
GameEval treats LLMs as game players and assigns them distinct roles with specific goals achieved by launching conversations of various forms.
We show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems.
arXiv Detail & Related papers (2023-08-19T14:33:40Z) - Do LLMs Understand Social Knowledge? Evaluating the Sociability of Large
Language Models with SocKET Benchmark [14.922083834969323]
Large language models (LLMs) have been shown to perform well at a variety of syntactic, discourse, and reasoning tasks.
We introduce a new theory-driven benchmark, SocKET, that contains 58 NLP tasks testing social knowledge.
arXiv Detail & Related papers (2023-05-24T09:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.