Fugu-MT 論文翻訳(概要): QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

論文の概要: QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

arxiv url: http://arxiv.org/abs/2605.27068v1
Date: Tue, 26 May 2026 14:19:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:42.215389
Title: QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents
Title（参考訳）: QUICK:マルチモーダル・ソーシャル・ドダクション・エージェントにおけるコミュニケーション知識の質問・理解・監査
Authors: Ye Yuan, Rui Song, Weien Li, Zeyu Li, Haochen Liu, Xiangyu Kong, Changjiang Han, Yonghan Yang, Zichen Zhao, Zixuan Dong, Fuyuan Lyu, Bowei He, Haolun Wu, Jikun Kang, Xue Liu,
Abstract要約: QUACKはマルチモーダルな社会的推論におけるエージェント言語の基礎を監査するための評価フレームワークである。エンジンログから各エージェントの基幹軌道を再構築し、それに対するすべての議論のクレームをチェックする。最強のエージェントでさえ、検証可能な空間的主張の15.1%を幻覚させ、根拠のない証拠なしに告発の半数以上を犯している。
参考スコア（独自算出の注目度）: 38.13248430205106
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK.
Abstract（参考訳）: 社会的推論ゲームは、Large Language Model (LLM)エージェントにおける推論、騙し、コーディネーション、信念モデリングのための一般的なテストベッドとなっている。しかし、ほとんどの環境は、勝利率のようなゲーム結果によってのみ得点され、主にテキストのみのインタラクションに留まっているため、エージェントの言語が実際に認識され、何をしたか、あるいはその動作の根底にある障害モードを特定することは困難である。このギャップに対処するために、マルチモーダルな社会的推論におけるエージェント言語の基礎を監査するためのオープンソース環境および評価フレームワークであるQUACKを紹介する。 QUICKはエージェントをゲーム結果、行動軌跡、発話レベルの一貫性の3つのレベルで評価する。その中核となるステートメント検証パイプラインは、エンジンログから各エージェントの基幹軌道を再構築し、それに対するすべての議論の主張をチェックする。 3つのフロンティアVLMを均質的およびクロスモデル逆境的な設定で評価すると、最強のエージェントでさえその検証可能な空間的クレームの15.1%を幻覚し、根拠のない証拠のない告発の半数以上を犯すことがわかった。我々は、完全なエンジン、評価フレームワーク、ツールキット、ログをhttps://github.com/AAAAA-Academia-Attractions/QUACKでリリースします。

論文の概要: QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

関連論文リスト