Fugu-MT 論文翻訳(概要): Intentional Deception as Controllable Capability in LLM Agents

論文の概要: Intentional Deception as Controllable Capability in LLM Agents

arxiv url: http://arxiv.org/abs/2603.07848v1
Date: Sun, 08 Mar 2026 23:48:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:15.321662
Title: Intentional Deception as Controllable Capability in LLM Agents
Title（参考訳）: LLM剤の制御可能な機能としての意図的誤認
Authors: Jason Starace, Terence Soule,
Abstract要約: 本稿では,マルチエージェントシステムにおいて,意図的騙しを工学的能力として体系的に研究する。本研究では,ターゲットエージェントの特徴を推定し,その信念や動機に反する行動に対して,意図的反応を操る2段階のシステムについて検討する。認知的介入は、一様分布ではなく、特定の行動プロファイルに集中する差分効果を生じさせる。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: As LLM-based agents increasingly operate in multi-agent systems, understanding adversarial manipulation becomes critical for defensive design. We present a systematic study of intentional deception as an engineered capability, using LLM-to-LLM interactions within a text-based RPG where parameterized behavioral profiles (9 alignments x 4 motivations, yielding 36 profiles with explicit ethical ground truth) serve as our experimental testbed. Unlike accidental deception from misalignment, we investigate a two-stage system that infers target agent characteristics and generates deceptive responses steering targets toward actions counter to their beliefs and motivations. We find that deceptive intervention produces differential effects concentrated in specific behavioral profiles rather than distributed uniformly, and that 88.5% of successful deceptions employ misdirection (true statements with strategic framing) rather than fabrication, indicating fact-checking defenses would miss the large majority of adversarial responses. Motivation, inferable at 98%+ accuracy, serves as the primary attack vector, while belief systems remain harder to identify (49% inference ceiling) or exploit. These findings identify which agent profiles require additional safeguards and suggest that current fact-verification approaches are insufficient against strategically framed deception.
Abstract（参考訳）: LLMをベースとしたエージェントが多エージェントシステムでますます運用されるにつれて、敵の操作を理解することは防御設計にとって重要となる。テキストベースRPGにおけるLLM-to-LLMインタラクションを用いて、意図的騙しを工学的能力として体系的に研究し、パラメータ化された行動プロファイル(x4のモチベーションを9つアライメントし、36のプロファイルを明示的な倫理的根拠の真理で生成する)がテストベッドとして機能することを示した。誤認識の誤認とは違って、ターゲットエージェントの特徴を推定し、その信念や動機に反する行動に対して、目標を操る偽装応答を生成する2段階のシステムについて検討する。詐欺的介入は、一様分布ではなく、特定の行動プロファイルに集中した差分効果を生じさせ、88.5%の成功は、偽造ではなく、誤った方向(真の戦略的フレーミングを伴う声明)を用いており、事実チェックの防御が敵の反応の大部分を見逃すことを示唆している。モチベーションは98%以上の精度で推測可能であり、主要な攻撃ベクトルとして機能する一方、信念システムでは(49%の推論天井)特定や悪用が困難である。これらの結果から,どのエージェントプロファイルが追加の保護を必要とするかが明らかとなり,現在の事実検証アプローチは戦略的枠組みによる詐欺に対して不十分であることが示唆された。

論文の概要: Intentional Deception as Controllable Capability in LLM Agents

関連論文リスト