Fugu-MT 論文翻訳(概要): Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

論文の概要: Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

arxiv url: http://arxiv.org/abs/2508.06361v2
Date: Mon, 29 Sep 2025 09:05:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 17:47:09.14559
Title: Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts
Title（参考訳）: プロンプトに起因した嘘の超過 - 良性プロンプトに対するLDMの偽装の調査から-
Authors: Zhaomin Wu, Mingzhe Du, See-Kiong Ng, Bingsheng He,
Abstract要約: 大規模言語モデル(LLM)は、推論、計画、意思決定のタスクに広くデプロイされている。そこで我々は, 接触探索質問(CSQ)に基づく枠組みを提案し, 騙しの可能性を定量化する。
参考スコア（独自算出の注目度）: 79.1081247754018
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are widely deployed in reasoning, planning, and decision-making tasks, making their trustworthiness critical. A significant and underexplored risk is intentional deception, where an LLM deliberately fabricates or conceals information to serve a hidden objective. Existing studies typically induce deception by explicitly setting a hidden objective through prompting or fine-tuning, which may not reflect real-world human-LLM interactions. Moving beyond such human-induced deception, we investigate LLMs' self-initiated deception on benign prompts. To address the absence of ground truth, we propose a framework based on Contact Searching Questions~(CSQ). This framework introduces two statistical metrics derived from psychological principles to quantify the likelihood of deception. The first, the Deceptive Intention Score, measures the model's bias toward a hidden objective. The second, the Deceptive Behavior Score, measures the inconsistency between the LLM's internal belief and its expressed output. Evaluating 16 leading LLMs, we find that both metrics rise in parallel and escalate with task difficulty for most models. Moreover, increasing model capacity does not always reduce deception, posing a significant challenge for future LLM development.
Abstract（参考訳）: 大規模言語モデル(LLM)は、推論、計画、意思決定のタスクに広くデプロイされており、信頼性を重要視している。 LLMは意図的に情報を偽造または隠蔽し、隠された目的に役立てる。既存の研究は、現実世界の人間とLLMの相互作用を反映していないかもしれない、プロンプトや微調整を通じて、隠された目的を明示的に設定することで、詐欺を誘発する。このような人為的騙しを超越して、良性刺激に対するLSMの自己開始性騙しについて検討する。そこで本研究では,接点探索質問(CSQ)に基づく枠組みを提案する。この枠組みは、虚偽の可能性を定量化するために、心理学的原理から導かれた2つの統計指標を導入している。第一に、認知的意図スコア(Deceptive Intention Score)は、隠れた目的に対するモデルのバイアスを測定する。 2つ目は、認知行動スコア(Deceptive Behavior Score)であり、LLMの内部信念と表現された出力との矛盾を測定する。 16のLLMを評価すると、両方のメトリクスが並列に上昇し、ほとんどのモデルでタスクの難易度に応じてエスカレートすることがわかった。さらに、モデルキャパシティの増大は必ずしも騙しを減らさないため、将来のLLM開発において大きな課題となる。

論文の概要: Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

関連論文リスト