Related papers: Democratizing Diplomacy: A Harness for Evaluating Any Large Language Model on Full-Press Diplomacy

Democratizing Diplomacy: A Harness for Evaluating Any Large Language Model on Full-Press Diplomacy

URL: http://arxiv.org/abs/2508.07485v1
Date: Sun, 10 Aug 2025 21:07:08 GMT
Title: Democratizing Diplomacy: A Harness for Evaluating Any Large Language Model on Full-Press Diplomacy
Authors: Alexander Duffy, Samuel J Paech, Ishana Shastri, Elizabeth Karpinski, Baptiste Alloui-Cros, Tyler Marques, Matthew Lyle Olson,
Abstract summary: We present the first evaluation harness that enables any out-of-the-box, local, Large Language Models (LLMs) to play full-press Diplomacy without fine-tuning or specialized training.<n>Previous work required frontier LLMs, or fine-tuning, due to the high complexity and information density of Diplomacy's game state.<n>Our harness democratizes the evaluation of strategic reasoning in LLMs by eliminating the need for fine-tuning, and it provides insights into how these capabilities emerge naturally from widely used LLMs.
Score: 37.54766836927425
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present the first evaluation harness that enables any out-of-the-box, local, Large Language Models (LLMs) to play full-press Diplomacy without fine-tuning or specialized training. Previous work required frontier LLMs, or fine-tuning, due to the high complexity and information density of Diplomacy's game state. Combined with the high variance of matches, these factors made Diplomacy prohibitive for study. In this work, we used data-driven iteration to optimize a textual game state representation such that a 24B model can reliably complete matches without any fine tuning. We develop tooling to facilitate hypothesis testing and statistical analysis, and we present case studies on persuasion, aggressive playstyles, and performance across a range of models. We conduct a variety of experiments across many popular LLMs, finding the larger models perform the best, but the smaller models still play adequately. We also introduce Critical State Analysis: an experimental protocol for rapidly iterating and analyzing key moments in a game at depth. Our harness democratizes the evaluation of strategic reasoning in LLMs by eliminating the need for fine-tuning, and it provides insights into how these capabilities emerge naturally from widely used LLMs. Our code is available in the supplement and will be open sourced.

Related papers

Game of Thought: Robust Information Seeking with Large Language Models Using Game Theory [37.51238507036326]
We use the game of Twenty Questions to evaluate the information-seeking ability of Large Language Models (LLMs)<n>We propose Game of Thought (GoT), a framework that applies game-theoretic techniques to approximate a Nash equilibrium (NE) strategy for the restricted variant of the game.
arXiv Detail & Related papers (2026-02-02T06:33:18Z)
Beyond Survival: Evaluating LLMs in Social Deduction Games with Human-Aligned Strategies [54.08697738311866]
Social deduction games like Werewolf combine language, reasoning, and strategy.<n>We curate a high-quality, human-verified multimodal Werewolf dataset containing over 100 hours of video, 32.4M utterance tokens, and 15 rule variants.<n>We propose a novel strategy-alignment evaluation that leverages the winning faction's strategies as ground truth in two stages.
arXiv Detail & Related papers (2025-10-13T13:33:30Z)
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games [16.187737674778234]
We present textbfbenchname, a benchmark designed to train and evaluate Large Language Model (LLM) agents across diverse real-world video games.<n>To support consistent evaluation of LLMs, we introduce a plug-and-play interface based on Model Context Protocol (MCP)<n>Orak offers a comprehensive evaluation framework, encompassing general game score leaderboards, LLM battle arenas, and in-depth analyses of visual input state, agentic strategies, and fine-tuning effects.
arXiv Detail & Related papers (2025-06-04T06:40:33Z)
Improving LLM General Preference Alignment via Optimistic Online Mirror Descent [57.622821649679786]
Reinforcement learning from human feedback (RLHF) has demonstrated remarkable effectiveness in aligning large language models (LLMs) with human preferences.<n>In this paper, we drop the Bradley-Terry (BT) model assumption and study LLM alignment under general preferences, formulated as a two-player game.<n>We show that our approach achieves an $O(T-1)$ bound on the duality gap, improving upon the previous $O(T-1/2)$ result.
arXiv Detail & Related papers (2025-02-24T05:24:52Z)
Can Large Language Models Capture Video Game Engagement? [1.3873323883842132]
We evaluate comprehensively the capacity of popular Large Language Models to annotate and successfully predict continuous affect annotations of videos.<n>We run over 2,400 experiments to investigate the impact of LLM architecture, model size, input modality, prompting strategy, and ground truth processing method on engagement prediction.
arXiv Detail & Related papers (2025-02-05T17:14:47Z)
Mastering Board Games by External and Internal Planning with Language Models [30.782334791241556]
We show that search-based planning can yield significant improvements in Large Language Models game-playing strength.<n>We introduce, compare and contrast two major approaches: in external search, the model guides Monte Carlo Tree Search rollouts and evaluations without calls to an external game engine, and in internal search, the model is trained to generate in-context a linearized tree of search and a resulting final choice.<n>Our proposed approach, combining search with domain knowledge, is not specific to board games, hinting at more general future applications.
arXiv Detail & Related papers (2024-12-02T18:56:51Z)
TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs [45.12542636218608]
We propose TMGBench, characterized by comprehensive game type coverage, diverse scenarios and flexible game organization.<n>Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2x2 games, constructed as classic games in our benchmark.<n>To provide a sustainable evaluation framework adaptable to increasingly powerful LLMs, we treat the aforementioned games as atomic units.
arXiv Detail & Related papers (2024-10-14T13:15:34Z)
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations [87.99872683336395]
Large Language Models (LLMs) are integrated into critical real-world applications. This paper evaluates LLMs' reasoning abilities in competitive environments. We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
arXiv Detail & Related papers (2024-02-19T18:23:36Z)
Evaluating Language Model Agency through Negotiations [39.87262815823634]
Negotiation games enable us to study multi-turn, and cross-model interactions, modulate complexity, and side-step accidental evaluation data leakage. We use our approach to test six widely used and publicly accessible LMs, evaluating performance and alignment in both self-play and cross-play settings.
arXiv Detail & Related papers (2024-01-09T13:19:37Z)
ALYMPICS: LLM Agents Meet Game Theory -- Exploring Strategic Decision-Making with AI Agents [77.34720446306419]
Alympics is a systematic simulation framework utilizing Large Language Model (LLM) agents for game theory research. Alympics creates a versatile platform for studying complex game theory problems.
arXiv Detail & Related papers (2023-11-06T16:03:46Z)
Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models [105.39236338147715]
The paper is inspired by the popular language game Who is Spy'' We develop DEEP to evaluate LLMs' expression and disguising abilities. We then introduce SpyGame, an interactive multi-agent framework.
arXiv Detail & Related papers (2023-10-31T14:37:42Z)
SPRING: Studying the Paper and Reasoning to Play Games [102.5587155284795]
We propose a novel approach, SPRING, to read the game's original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM) In experiments, we study the quality of in-context "reasoning" induced by different forms of prompts under the setting of the Crafter open-world environment. Our experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories.
arXiv Detail & Related papers (2023-05-24T18:14:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.