Fugu-MT 論文翻訳(概要): MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs

論文の概要: MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs

arxiv url: http://arxiv.org/abs/2605.29512v1
Date: Thu, 28 May 2026 07:33:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:55.953544
Title: MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs
Title（参考訳）: MINDGAMES:マルチエージェントLLMにおける社会的・戦略的推論評価のためのライブアリーナ
Authors: Kevin Wang, Anna Thöni, Benjamin Kempinski, Bobby Cheng, Jianzhu Yao, Benjamin Finch, Leon Guertler, Viraj Nadkarni, Yihan Jiang, Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov, Siyuan Wu, Yu-Chi Cheng, Yan-Ru Ju, Ti-Rong Wu, I-Hsuan Chu, Yu-Yu Yang, I-Chen Wu, Yitian Huang, Qinlu Cao, Yiheng Sun, Yuhong Dai, Hongkun Yao, Jingxuan Fu, Jiwei Zhang, Hao Liao, Mossimo Ebeling, Govind Arun, Sadhvik Bathini, Mihir S Arya, Avinash Anish, Aditya Ranjan, Kirtana Sunil Phatnani, Paval KS, Vrushali Mehta, Aravind S, Nikhil Arora, Tanya Upadhyay, Amol Bandagale, Yuan Lu, ChunEn Hsiao, YuTing Lin, Arvin Chung, Jerry John Thomas, Mathieu Laurière, Leshem Choshen, Yoram Bachrach, Pramod Viswanath, Maria Polukarov, Cheston Tan, Tal Kachman, Atlas Wang,
Abstract要約: 大規模言語モデル(LLM)のためのマルチゲームアリーナと評価プラットフォームであるMindgamesを紹介する。 Mindgamesは、統合されたインタラクションインターフェース、TrueSkillベースの評価、および4つのゲーム環境にわたる完全な軌跡ログを提供する。我々は,決定論的オフライントーナメントプロトコルMG-Refとともに,ターンレベルの観察,アクション,報酬を含む29,571個のマルチエージェントゲームを分析した。
参考スコア（独自算出の注目度）: 54.81359054218573
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly deployed as interactive agents, yet their capacity for social and strategic reasoning over extended interaction remains poorly understood. Existing evaluations rely on static vignettes or single-game benchmarks that cannot capture the sustained, multi-faceted reasoning that real-world multi-agent settings demand. We introduce Mindgames, a multi-game arena and evaluation platform for LLM agents that operationalizes complementary reasoning demands relevant to ``theory of mind'': belief attribution under hidden information, opponent modeling through repeated strategic interaction, cooperative inference under knowledge asymmetries, and sustained deception in social deduction. Built on TextArena, Mindgames provides a unified interaction interface, TrueSkill-based rating, and full trajectory logging across four game environments. We instantiate Mindgames through a 2025 competition cycle hosted at a major AI conference, which assessed 944 submitted agents from 76 teams across four games: Colonel Blotto, Iterated Prisoner's Dilemma, Codenames, and Secret Mafia. Our analysis surfaces both agent-level and evaluation-level limitations: brittle rule adherence remains a major bottleneck, top-performing systems repeatedly rely on explicit structural scaffolding, and leaderboard validity differs sharply across environments. In particular, failure-heavy environments can reward robustness to opponent errors as much as strategic ability, with Secret Mafia exhibiting a pronounced error-survival confound in this cycle. We release a dataset of 29,571 multi-agent games with turn-level observations, actions, and rewards, together with MG-Ref, a deterministic offline tournament protocol that scores new agents against a frozen reference pool of top-ranked, low-error Stage~II submissions under the same error-attribution lens used in this analysis.
Abstract（参考訳）: 大規模言語モデル(LLM)は、対話型エージェントとしてますます多くデプロイされているが、対話性の拡張に対する社会的および戦略的推論の能力はいまだによく分かっていない。既存の評価は、現実のマルチエージェント設定が要求する持続的で多面的な推論をキャプチャできない静的なウィグレットやシングルゲームベンチマークに依存している。我々は,「心の理論」に関連する補完的推論要求を運用するLDMエージェントのためのマルチゲームアリーナと評価プラットフォームであるMindgamesを紹介した。 TextArena上に構築されたMindgamesは、統合されたインタラクションインターフェース、TrueSkillベースの評価、および4つのゲーム環境にわたる完全なログを提供する。私たちは2025年に開催された大規模なAIカンファレンスで、Blotto大佐、Iterated Prisoner's Dilemma、Codenames、Secret Mafiaという4つのゲームにわたる76チームから944名のエージェントを提出した。我々の分析ではエージェントレベルと評価レベルの両方の制限が表面化しており、脆性規則の遵守は依然として大きなボトルネックであり、トップパフォーマンスシステムは明示的な構造的足場に繰り返し依存し、リーダーボードの妥当性は環境によって大きく異なる。特に、障害の多い環境は、競合するエラーと戦略的な能力にロバスト性を与えることができ、シークレット・マフィアはこのサイクルで明らかなエラー生存の相違を示す。我々は、ターンレベルの観察、アクション、報酬を含む29,571個のマルチエージェントゲームのデータセットを、MG-Refとともにリリースする。MG-Refは、決定論的オフライントーナメントプロトコルで、トップランクの低エラーステージ〜IIの凍結参照プールに対して新しいエージェントをスコア付けする。

論文の概要: MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs

関連論文リスト