Fugu-MT 論文翻訳(概要): The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning

論文の概要: The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning

arxiv url: http://arxiv.org/abs/2605.17113v1
Date: Sat, 16 May 2026 18:36:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:47.613478
Title: The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning
Title（参考訳）: No Returnのポイント:言語モデル推論における知覚的コミットの非現実的局所化
Authors: Scott Merrill, Shashank Srivastava,
Abstract要約: 本稿では,言語モデルにおける偽装の偽装化について紹介する。詐欺は決して誘発されないが、戦略的インセンティブから生じる5つの環境を構築します。得られたコーパスは4つの推論モデルで$sim$1.46Mの文をローカライズする。
参考スコア（独自算出の注目度）: 9.827138852806305
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing deception datasets label completed outputs as honest or deceptive, treating deception as a property of the final response rather than a function of the model's reasoning trace. This obscures a more fundamental question: when does a language model become committed to deception? We introduce counterfactual localization: for each sentence prefix in a reasoning trace, we fix the prefix, resample continuations, and estimate the probability of a deceptive outcome. To scale this, we construct five environments (spanning strategic bluffing, maze guidance, financial advice, used-car sales, and offer negotiation) in which deception is never prompted but emerges from strategic incentives and labels follow mechanically from environment state rather than subjective human judgment. The resulting corpus localizes $\sim$1.46M sentences across four reasoning models, drawn from over 94.1M sampled continuations, 91.5B generated tokens, and over 100K scenarios. Sentence-level human evaluation confirms that detected commitment points correspond to interpretable shifts in decision state. Using this resource, we show that lexical cues for commitment prediction transfer poorly across environments, whereas attention-based transition features generalize out of distribution, suggesting that deceptive commitment is reflected in reusable changes in reasoning dynamics rather than surface form. We further identify compact attention-head sets (under 10% of heads) that, selected on one environment, causally suppress deceptive commitment across held-out environments. We release the corpus as a substrate for studying deception, and more broadly commitment, in language-model reasoning.
Abstract（参考訳）: 既存の偽装データセットは、完了した出力を正直または偽装であるとラベル付けし、偽装をモデルの推論トレースの関数ではなく最終応答の特性として扱う。言語モデルはいつ、欺くことにコミットされるのか? 我々は,各文の接頭辞を推論トレースに導入し,その接頭辞を固定し,継続をサンプリングし,偽りの結果の確率を推定する。この規模を拡大するために、我々は、虚偽を起こさないが、戦略的インセンティブやラベルが、主観的な人的判断よりも、環境状態から機械的に追従する5つの環境(戦略ブラッフィング、迷路指導、金融アドバイス、中古車販売、交渉提案)を構築した。得られたコーパスは4つの推論モデルにまたがる$\sim$1.46Mの文をローカライズし、94.1M以上のサンプルの継続、91.5Bの生成されたトークン、100K以上のシナリオから引き出された。文レベルの人間評価は、検出されたコミットメントポイントが決定状態の解釈可能なシフトに対応することを確認する。この資源を用いて, 環境間でのコミットメント予測が不十分な場合の語彙的手がかりを示す一方で, 注意に基づく遷移特徴は分布から一般化し, 表面形状ではなく, 推論力学における再利用可能な変化に, 認識的コミットメントが反映されることが示唆された。さらに、1つの環境上で選択されたコンパクトな注目ヘッドセット(頭部の10%以下)を同定し、ホールドアウト環境全体にわたって侵害的コミットメントを因果的に抑制する。我々は, 言語モデル推論において, 詐欺研究の基盤としてコーパスを公開し, より広範にコミットメントする。

論文の概要: The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning

関連論文リスト