Related papers: Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities

Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities

URL: http://arxiv.org/abs/2510.19892v1
Date: Wed, 22 Oct 2025 17:21:16 GMT
Title: Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities
Authors: Nishant Balepur, Dang Nguyen, Dayeon Ki,
Abstract summary: We propose game-based evaluations to holistically assess capabilities.<n>Games require multiple abilities for players to win, are inherently competitive, and are governed by fix, objective rules.<n>We manifest this evaluation specifically through Dixit, a fantasy card game.
Score: 17.019600215402704
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-modal large language models (MLMs) are often assessed on static, individual benchmarks -- which cannot jointly assess MLM capabilities in a single task -- or rely on human or model pairwise comparisons -- which is highly subjective, expensive, and allows models to exploit superficial shortcuts (e.g., verbosity) to inflate their win-rates. To overcome these issues, we propose game-based evaluations to holistically assess MLM capabilities. Games require multiple abilities for players to win, are inherently competitive, and are governed by fix, objective rules, and makes evaluation more engaging, providing a robust framework to address the aforementioned challenges. We manifest this evaluation specifically through Dixit, a fantasy card game where players must generate captions for a card that trick some, but not all players, into selecting the played card. Our quantitative experiments with five MLMs show Dixit win-rate rankings are perfectly correlated with those on popular MLM benchmarks, while games between human and MLM players in Dixit reveal several differences between agent strategies and areas of improvement for MLM reasoning.

Related papers

LLM CHESS: Benchmarking Reasoning and Instruction-Following in LLMs through Chess [30.797553771114746]
We introduce LLM CHESS, an evaluation framework designed to probe the generalization of reasoning and instruction-following abilities in large language models (LLMs)<n>We rank over 50 open and closed source models by playing against a random opponent using a range of behavioral metrics, including move quality, move legality, hallucinated actions, and game duration.<n>For a subset of top reasoning models, we derive an Elo estimate by playing against a chess engine with variably configured skill, which allows for comparisons between models in an easily understandable way.
arXiv Detail & Related papers (2025-12-01T18:51:08Z)
LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition [104.81487689011341]
We introduce LM Fight Arena, a novel framework that evaluates large multimodal models in Mortal Kombat II.<n>Unlike static evaluations, LM Fight Arena provides a fully automated, reproducible, and objective assessment of an LMM's strategic reasoning capabilities.
arXiv Detail & Related papers (2025-10-10T02:19:21Z)
Can Large Language Models Master Complex Card Games? [18.39826127562161]
Large language models (LLMs) have exhibited remarkable capabilities across various tasks.<n>We show that LLMs can approach the performance of strong game AIs through supervised fine-tuning on high-quality data.<n>LLMs experience a decline in general capabilities when mastering complex games, but this decline can be mitigated by integrating a certain amount of general instruction data.
arXiv Detail & Related papers (2025-09-01T10:11:56Z)
Who is a Better Player: LLM against LLM [53.46608216197315]
We propose an adversarial benchmarking framework to assess the comprehensive performance of Large Language Models (LLMs) through board games competition.<n>We introduce Qi Town, a specialized evaluation platform that supports 5 widely played games and involves 20 LLM-driven players.
arXiv Detail & Related papers (2025-08-05T06:41:47Z)
lmgame-Bench: How Good are LLMs at Playing Games? [60.01834131847881]
We study the major challenges in using popular video games to evaluate modern large language model (LLM) agents.<n>We introduce lmgame-Bench to turn games into reliable evaluations.
arXiv Detail & Related papers (2025-05-21T06:02:55Z)
GAMEBoT: Transparent Assessment of LLM Reasoning in Games [54.49589494014147]
GAMEBoT is a gaming arena designed for rigorous assessment of Large Language Models.<n>We benchmark 17 prominent LLMs across eight games, encompassing various strategic abilities and game characteristics.<n>Our results suggest that GAMEBoT presents a significant challenge, even when LLMs are provided with detailed CoT prompts.
arXiv Detail & Related papers (2024-12-18T08:32:53Z)
Evaluating Creativity and Deception in Large Language Models: A Simulation Framework for Multi-Agent Balderdash [6.65572931991284]
Large Language Models (LLMs) have shown impressive capabilities in complex tasks and interactive environments. This paper introduces a simulation framework utilizing the game Balderdash to evaluate both the creativity and logical reasoning of LLMs.
arXiv Detail & Related papers (2024-11-15T18:42:48Z)
LLMs May Not Be Human-Level Players, But They Can Be Testers: Measuring Game Difficulty with LLM Agents [10.632179121247466]
We propose a general game-testing framework using LLM agents and test it on two widely played strategy games: Wordle and Slay the Spire. Our results reveal an interesting finding: although LLMs may not perform as well as the average human player, their performance, when guided by simple, generic prompting techniques, shows a statistically significant and strong correlation with difficulty indicated by human players. This suggests that LLMs could serve as effective agents for measuring game difficulty during the development process.
arXiv Detail & Related papers (2024-10-01T18:40:43Z)
SmartPlay: A Benchmark for LLMs as Intelligent Agents [45.76707302899935]
SmartPlay consists of 6 different games, including Rock-Paper-Scissors, Tower of Hanoi, Minecraft. Each game challenges a subset of 9 important capabilities of an intelligent LLM agent. Tests include reasoning with object dependencies, planning ahead, spatial reasoning, learning from history, and understanding randomness.
arXiv Detail & Related papers (2023-10-02T18:52:11Z)
Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation [52.930183136111864]
We propose using scorable negotiation to evaluate Large Language Models (LLMs) To reach an agreement, agents must have strong arithmetic, inference, exploration, and planning capabilities. We provide procedures to create new games and increase games' difficulty to have an evolving benchmark.
arXiv Detail & Related papers (2023-09-29T13:33:06Z)
GameEval: Evaluating LLMs on Conversational Games [93.40433639746331]
We propose GameEval, a novel approach to evaluating large language models (LLMs) GameEval treats LLMs as game players and assigns them distinct roles with specific goals achieved by launching conversations of various forms. We show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems.
arXiv Detail & Related papers (2023-08-19T14:33:40Z)
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [76.21004582932268]
We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Arena, a crowdsourced battle platform.
arXiv Detail & Related papers (2023-06-09T05:55:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.