Fugu-MT 論文翻訳(概要): Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model

論文の概要: Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model

arxiv url: http://arxiv.org/abs/2604.09665v2
Date: Wed, 15 Apr 2026 18:17:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-19 19:09:11.607508
Title: Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model
Title（参考訳）: デリバティブアライメントは深いが、不確かさは残る: ベースモデルに対する不安全行動の帰属による推論時間安全の改善
Authors: Pankayaraj Pathmanathan, Furong Huang,
Abstract要約: モデルサイズが大きく,安全性が向上しているにもかかわらず,教師と生徒の言語モデルの間にはアライメントギャップがあることが示される。本稿では,不安全な動作を潜在空間のベースLLMに還元するBoNサンプリング手法を提案する。特に7つの教師モデルと6つの生徒モデルが異なるクラスとサイズで、平均攻撃成功率(ASR)はDANで28.2%、WildJailbreakで31.3%、StrongREJECTベンチマークで35.4%低下した。
参考スコア（独自算出の注目度）: 50.29667251847595
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While the wide adoption of refusal training in large language models (LLMs) has showcased improvements in model safety, recent works have highlighted shortcomings due to the shallow nature of these alignment methods. To this end, the work on Deliberative alignment proposed distilling reasoning capabilities from stronger reasoning models, thereby instilling deeper safety in LLMs. In this work, we study the impact of deliberative alignment in language models. First, we show that despite being larger in model size and stronger in safety capability, there exists an alignment gap between teacher and student language models, which affects both the safety and general utility of the student model. Furthermore, we show that models aligned through deliberative alignment can retain unsafe behaviors from the base model despite learning the reasoning patterns of larger reasoning models. Building upon this observation, we propose a BoN sampling method that attributes the unsafe behavior back to the base LLMs in the latent space, thereby down-ranking unsafe responses to gain a meaningful improvement in model safety across multiple safety benchmarks with minimal loss in utility. In particular, across 7 teacher models and 6 student models of different classes and sizes, we show an average attack success rate (ASR) reduction of 28.2% in DAN, 31.3% in WildJailbreak and 35.4 % in StrongREJECT benchmarks. We further show that these safety gains prevail post RL training, thus highlighting the uncertainty in safety reasoning and it's explicit attribution to the base model.
Abstract（参考訳）: 大規模言語モデル(LLM)における拒絶訓練の広範な採用は、モデル安全性の改善を示す一方で、最近の研究は、これらのアライメント手法の浅さによる欠点を強調している。この目的のために、議論的アライメントの研究は、より強力な推論モデルから蒸留推論能力を提案し、LLMのより深い安全性を付与した。本研究では,言語モデルにおける熟考的アライメントの影響について検討する。まず, モデルサイズが大きく, 安全性が向上しているにもかかわらず, 教師と生徒の言語モデルの間には, 学生モデルの安全性と汎用性の両方に影響を及ぼすアライメントギャップが存在することを示す。さらに,検討的なアライメントによって整列されたモデルは,より大きな推論モデルの推論パターンを学習しながらも,ベースモデルから安全でない振る舞いを維持することができることを示す。そこで本研究では,複数の安全ベンチマークにおいてモデル安全性が向上し,実用性が最小限に抑えられた場合の安全性が向上することを示すため,安全でない動作を潜在空間のLLMに還元するBoNサンプリング手法を提案する。特に7つの教師モデルと6つの生徒モデルが異なるクラスとサイズで、平均攻撃成功率(ASR)はDANで28.2%、WildJailbreakで31.3%、StrongREJECTベンチマークで35.4%低下した。さらに、これらの安全性向上がRLトレーニング後にも顕著であることを示し、安全推論の不確実性を強調し、それがベースモデルへの明確な貢献であることを示す。

論文の概要: Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model

関連論文リスト