Fugu-MT 論文翻訳(概要): M3-BENCH: Process-Aware Evaluation of LLM Agents Social Behaviors in Mixed-Motive Games

論文の概要: M3-BENCH: Process-Aware Evaluation of LLM Agents Social Behaviors in Mixed-Motive Games

arxiv url: http://arxiv.org/abs/2601.08462v1
Date: Tue, 13 Jan 2026 11:38:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-14 18:27:19.17521
Title: M3-BENCH: Process-Aware Evaluation of LLM Agents Social Behaviors in Mixed-Motive Games
Title（参考訳）: M3-BENCH:混合運動ゲームにおけるLLMエージェントの社会的行動のプロセスアウェア評価
Authors: Sixiong Xie, Zhuofan Shi, Haiyang Shen, Gang Huang, Yun Ma, Xiang Jing,
Abstract要約: M3-Benchは、M3-Benchのマルチステージベンチマークである。我々は、多次元の証拠を解釈可能な社会行動像に集約するために、ビッグファイブパーソナリティモデルと社会交換理論を統合する。
参考スコア（独自算出の注目度）: 4.88323005571385
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As the capabilities of large language model (LLM) agents continue to advance, their advanced social behaviors, such as cooperation, deception, and collusion, call for systematic evaluation. However, existing benchmarks often emphasize a single capability dimension or rely solely on behavioral outcomes, overlooking rich process information from agents' decision reasoning and communicative interactions. To address this gap, we propose M3-Bench, a multi-stage benchmark for mixed-motive games, together with a process-aware evaluation framework that conducts synergistic analysis across three modules: BTA (Behavioral Trajectory Analysis), RPA (Reasoning Process Analysis), and CCA (Communication Content Analysis). Furthermore, we integrate the Big Five personality model and Social Exchange Theory to aggregate multi-dimensional evidence into interpretable social behavior portraits, thereby characterizing agents' personality traits and capability profiles beyond simple task scores or outcome-based metrics. Experimental results show that M3-Bench can reliably distinguish diverse social behavior competencies across models, and it reveals that some models achieve seemingly reasonable behavioral outcomes while exhibiting pronounced inconsistencies in their reasoning and communication.
Abstract（参考訳）: 大規模言語モデル(LLM)エージェントの能力が向上し続ければ、協力、騙し、共謀といった先進的な社会的行動は体系的な評価を要求する。しかし、既存のベンチマークでは、エージェントの決定的推論やコミュニケーションの相互作用から豊富なプロセス情報を見渡すことで、単一の能力の次元を強調したり、行動的な結果にのみ依存することが多い。このギャップに対処するため、混合動機ゲームのためのマルチステージベンチマークであるM3-Benchと、BTA(Behavioral Trajectory Analysis)、RPA(Reasoning Process Analysis)、CAA(Communication Content Analysis)の3つのモジュール間の相乗的解析を行うプロセス認識評価フレームワークを提案する。さらに、ビッグファイブ・パーソナリティモデルとソーシャル・エクスチェンジ理論を統合し、多次元の証拠を解釈可能な社会的行動像に集約することで、エージェントの性格特性や能力プロファイルを単純なタスクスコアや結果に基づくメトリクスを超えて特徴づける。実験結果から,M3-Benchはモデル間での多様な社会的行動能力を確実に区別できることが示された。

論文の概要: M3-BENCH: Process-Aware Evaluation of LLM Agents Social Behaviors in Mixed-Motive Games

関連論文リスト