Fugu-MT 論文翻訳(概要): Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

論文の概要: Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

arxiv url: http://arxiv.org/abs/2508.06361v1
Date: Fri, 08 Aug 2025 14:46:35 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-11 20:39:06.268463
Title: Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts
Title（参考訳）: プロンプトに起因した嘘の超過 - 良性プロンプトに対するLDMの偽装の調査から-
Authors: Zhaomin Wu, Mingzhe Du, See-Kiong Ng, Bingsheng He,
Abstract要約: 大規模言語モデル(LLM)は、推論、計画、意思決定のタスクに広くデプロイされている。本稿では,虚偽の可能性を定量化するために,心理学的原理から得られた統計的指標を紹介する。その結果,最も先進的なLCMでさえ,複雑な問題に対処する上で,騙しの傾向が高まっていることが判明した。
参考スコア（独自算出の注目度）: 41.48336680924274
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have been widely deployed in reasoning, planning, and decision-making tasks, making their trustworthiness a critical concern. The potential for intentional deception, where an LLM deliberately fabricates or conceals information to serve a hidden objective, remains a significant and underexplored threat. Existing studies typically induce such deception by explicitly setting a "hidden" objective through prompting or fine-tuning, which may not fully reflect real-world human-LLM interactions. Moving beyond this human-induced deception, we investigate LLMs' self-initiated deception on benign prompts. To address the absence of ground truth in this evaluation, we propose a novel framework using "contact searching questions." This framework introduces two statistical metrics derived from psychological principles to quantify the likelihood of deception. The first, the Deceptive Intention Score, measures the model's bias towards a hidden objective. The second, Deceptive Behavior Score, measures the inconsistency between the LLM's internal belief and its expressed output. Upon evaluating 14 leading LLMs, we find that both metrics escalate as task difficulty increases, rising in parallel for most models. Building on these findings, we formulate a mathematical model to explain this behavior. These results reveal that even the most advanced LLMs exhibit an increasing tendency toward deception when handling complex problems, raising critical concerns for the deployment of LLM agents in complex and crucial domains.
Abstract（参考訳）: 大規模言語モデル(LLM)は、推論、計画、意思決定のタスクに広くデプロイされ、信頼性が重要な関心事となっている。 LLMが意図的に情報を偽造または隠蔽して隠蔽する意図的な偽造の可能性は、依然として重要で未調査の脅威である。既存の研究は、現実世界の人間とLLMの相互作用を完全に反映しないような「隠れた」目的を刺激または微調整によって明示的に設定することで、そのような欺きを誘発する。この人為的詐欺を超えて、良性刺激に対するLSMsの自己開始性詐欺について検討する。この評価における根拠的真理の欠如に対処するため,我々は「接触探索質問」を用いた新しい枠組みを提案する。この枠組みは、虚偽の可能性を定量化するために、心理学的原理から導かれた2つの統計指標を導入している。第一に、認知的意図スコア(Deceptive Intention Score)は、隠れた目的に対するモデルのバイアスを測定する。 2つ目は、認知行動スコア(Deceptive Behavior Score)で、LLMの内部信念と表現された出力との矛盾を測定する。 LLMを14個評価すると、両方の指標がエスカレートし、タスクの難しさが増大し、ほとんどのモデルでは並列に上昇することがわかった。これらの知見に基づいて、この振る舞いを説明する数学的モデルを定式化する。これらの結果から,最も先進的なLSMにおいても複雑な問題に対処する上での詐欺傾向が高まり,複雑な領域や重要な領域にLSMエージェントを配置する上での重大な懸念が高まっていることが明らかとなった。

論文の概要: Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

関連論文リスト