Fugu-MT 論文翻訳(概要): Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

論文の概要: Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

arxiv url: http://arxiv.org/abs/2604.11666v1
Date: Mon, 13 Apr 2026 16:14:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.673564
Title: Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind
Title（参考訳）: 共に演奏する:心の理論を通して、信念のステアリングのための二重エージェントディフェンダーを学ぶ
Authors: Hanqi Xiao, Vaidehi Patil, Zaid Khan, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal,
Abstract要約: 我々は、プライバシーをテーマとした新しいToMチャレンジ、ToM for Steering Beliefs (ToM-SB)を提案する。 ToM-SBを成功させるためには、攻撃者は攻撃者のToMを騙して機密情報を抽出したと信じ込ませなければならない。 Gemini3-ProやGPT-5.4のような強力なフロンティアモデルがToM-SBと戦っていることが分かりました。私たちは、強化学習を使用してAIダブルエージェントとして機能し、愚かさとToM報酬の両方をテストするためにToM-SBのモデルをトレーニングします。
参考スコア（独自算出の注目度）: 66.6995270293745
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when prompted to reason about the attacker's beliefs (ToM prompting). To close this gap, we train models on ToM-SB to act as AI Double Agents using reinforcement learning, testing both fooling and ToM rewards. Notably, we find a bidirectionally emergent relationship between ToM and attacker-fooling: rewarding fooling success alone improves ToM, and rewarding ToM alone improves fooling. Across four attackers with different strengths, six defender methods, and both in-distribution and out-of-distribution (OOD) evaluation, we find that gains in ToM and attacker-fooling are well-correlated, highlighting belief modeling as a key driver of success on ToM-SB. AI Double Agents that combine both ToM and fooling rewards yield the strongest fooling and ToM performance, outperforming Gemini3-Pro and GPT-5.4 with ToM prompting on hard scenarios. We also show that ToM-SB and AI Double Agents can be extended to stronger attackers, demonstrating generalization to OOD settings and the upgradability of our task.
Abstract（参考訳）: 大きな言語モデル(LLM)が対話システムを支えるエンジンとなるにつれ、対話相手の意図や状態を判断する能力(すなわち、造語理論(ToM))が、潜在的に敵対的パートナーとの安全な対話においてますます重要になる。 ToM-SB (ToM-SB for Steering Beliefs) というプライバシーをテーマとした新たなToMチャレンジを提案する。 ToM-SBを成功させるためには、攻撃者は攻撃者のToMを騙して機密情報を抽出したと信じ込ませなければならない。 Gemini3-Pro や GPT-5.4 のような強力なフロンティアモデルがToM-SBと闘い、攻撃者の信念(ToMのプロンプト)を推論するよう促されたとしても、攻撃者が部分的に攻撃者の事前知識を持つハードシナリオで騙すことがしばしば失敗する。このギャップを埋めるために、強化学習を使用してAIダブルエージェントとして機能し、愚かさとToM報酬の両方をテストするために、ToM-SBのモデルをトレーニングする。特に、ToMとアタッカー・フーリングの双方向的な関係は、愚かな成功に報いるだけでToMが改善され、ToMだけに報いると騙すことが改善される。 ToM-SB の成功要因として,ToM-SB の成功要因として,ToM とアタッカー・ファリングの利得がよく相関していることが判明した。 ToMと愚かな報酬の両方を組み合わせたAIダブルエージェントは、ToMがハードシナリオを推し進めるGemini3-ProとGPT-5.4を上回り、最強の愚かさとToMのパフォーマンスをもたらす。また、ToM-SBとAI Double Agentsはより強力な攻撃者に拡張可能であることを示し、OOD設定への一般化とタスクのアップグレード性を示す。

論文の概要: Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

関連論文リスト