Fugu-MT 論文翻訳(概要): Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL

論文の概要: Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL

arxiv url: http://arxiv.org/abs/2510.14318v1
Date: Thu, 16 Oct 2025 05:29:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-17 21:15:14.735979
Title: Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL
Title（参考訳）: マルチターンRLを用いた言語モデルからの知覚対話の評価と低減
Authors: Marwa Abdulhai, Ryan Cheng, Aryansh Shrivastava, Natasha Jaques, Yarin Gal, Sergey Levine,
Abstract要約: 大規模言語モデル(LLM)は、顧客サポート、教育、医療など、世界中の何百万もの人々と対話する。故意であれ不注意であれ、偽りのアウトプットを生産する能力は、重大な安全上の懸念を生じさせる。本研究では, LLM が会話中の偽装にどの程度関与しているかを考察し, 偽装を定量化する信念の誤調整尺度を提案する。
参考スコア（独自算出の注目度）: 64.3268313484078
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare. However, their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns. The unpredictable nature of LLM behavior, combined with insufficient safeguards against hallucination, misinformation, and user manipulation, makes their misuse a serious, real-world risk. In this paper, we investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception. We evaluate deception across four distinct dialogue scenarios, using five established deception detection metrics and our proposed metric. Our findings reveal this novel deception measure correlates more closely with human judgments than any existing metrics we test. Additionally, our benchmarking of eight state-of-the-art models indicates that LLMs naturally exhibit deceptive behavior in approximately 26% of dialogue turns, even when prompted with seemingly benign objectives. When prompted to deceive, LLMs are capable of increasing deceptiveness by as much as 31% relative to baselines. Unexpectedly, models trained with RLHF, the predominant approach for ensuring the safety of widely-deployed LLMs, still exhibit deception at a rate of 43% on average. Given that deception in dialogue is a behavior that develops over an interaction history, its effective evaluation and mitigation necessitates moving beyond single-utterance analyses. We introduce a multi-turn reinforcement learning methodology to fine-tune LLMs to reduce deceptive behaviors, leading to a 77.6% reduction compared to other instruction-tuned models.
Abstract（参考訳）: 大規模言語モデル(LLM)は、顧客サポート、教育、医療など、世界中の何百万もの人々と対話する。しかし、意図的であれ、意図的であれ、偽りのアウトプットを生産する能力は、重大な安全上の懸念を生じさせる。 LLMの行動の予測不能な性質は、幻覚、誤報、ユーザー操作に対する十分な保護と組み合わされ、彼らの誤用は深刻な現実世界のリスクとなる。本稿では,LLMが対話内での詐欺行為の程度について検討し,虚偽の定量化のための信念誤認識尺度を提案する。我々は,4つの異なる対話シナリオにおいて,5つの確立された嘘検出指標と提案したメトリクスを用いて,嘘評価を行った。我々の発見は、この新たな偽造対策は、我々がテストしている既存の指標よりも、人間の判断と密接に関連していることを示している。さらに,8つの最先端モデルのベンチマークにより,LLMが自然に知覚行動を示すことが示唆された。騙されるように促されると、LLMは、ベースラインに対して最大31%の偽りを増大させることができる。当然のことながら、広く配備されたLLMの安全性を確保する主要なアプローチであるRLHFで訓練されたモデルは、それでも平均して43%の速度で偽造されている。対話における騙しは、対話の歴史を乗り越える行動であり、その効果的な評価と緩和は、単一発話の分析を超えて移動する必要がある。本研究では, マルチターン強化学習手法を導入し, 他の命令学習モデルと比較して77.6%削減した。

論文の概要: Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL

関連論文リスト