Fugu-MT 論文翻訳(概要): Stress Testing Deliberative Alignment for Anti-Scheming Training

論文の概要: Stress Testing Deliberative Alignment for Anti-Scheming Training

arxiv url: http://arxiv.org/abs/2509.15541v1
Date: Fri, 19 Sep 2025 02:49:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-22 18:18:10.96608
Title: Stress Testing Deliberative Alignment for Anti-Scheming Training
Title（参考訳）: ストレステストによる反ショーティングトレーニングの適応
Authors: Bronson Schoen, Evgenia Nitishinskaya, Mikita Balesni, Axel Højmark, Felix Hofstätter, Jérémy Scheurer, Alexander Meinke, Jason Wolfe, Teun van der Weij, Alex Lloyd, Nicholas Goldowsky-Dill, Angela Fan, Andrei Matveiakin, Rusheb Shah, Marcus Williams, Amelia Glaese, Boaz Barak, Wojciech Zaremba, Marius Hobbhahn,
Abstract要約: 高い能力を持つAIシステムは、私たちが"スケジュール"と呼ぶ、ミスマッチした目標を秘密裏に追求できるスキーマの測定と緩和には、MLで一般的に使用される方法とは異なる戦略が必要である。ルールを秘密裏に破ったり、テストで意図的に過小評価したりするなど、幅広いカテゴリの"カバレッジアクション"を、スケジューリングのプロキシとして使用しています。
参考スコア（独自算出の注目度）: 39.16405205129775
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Highly capable AI systems could secretly pursue misaligned goals -- what we call "scheming". Because a scheming AI would deliberately try to hide its misaligned goals and actions, measuring and mitigating scheming requires different strategies than are typically used in ML. We propose that assessing anti-scheming interventions requires at least (1) testing propensity to scheme on far out-of-distribution (OOD) tasks, (2) evaluating whether lack of scheming is driven by situational awareness, and (3) checking for robustness to pre-existing misaligned goals. We use a broad category of "covert actions" -- such as secretly breaking rules or intentionally underperforming in tests -- as a proxy for scheming, and design evaluations for covert actions. We then stress-test deliberative alignment as a case study for anti-scheming. Across 26 OOD evaluations (180+ environments), deliberative alignment reduces covert action rates (OpenAI o3: 13%->0.4%) but does not fully eliminate them. Our mitigation is also able to largely stop agents from pursuing a hidden goal previously trained into the model, but we still find misbehavior after additional red-teaming. We find that models' chain-of-thought (CoT) often demonstrates awareness of being evaluated for alignment, and show causal evidence that this awareness decreases covert behavior, while unawareness increases it. Therefore, we cannot exclude that the observed reductions in covert action rates are at least partially driven by situational awareness. While we rely on human-legible CoT for training, studying situational awareness, and demonstrating clear evidence of misalignment, our ability to rely on this degrades as models continue to depart from reasoning in standard English. We encourage research into alignment mitigations for scheming and their assessment, especially for the adversarial case of deceptive alignment, which this paper does not address.
Abstract（参考訳）: 高い能力を持つAIシステムは、私たちが"計画"と呼ぶ、ミスマッチした目標を秘密裏に追求する可能性がある。スケジューリングAIは、意図的な目標と行動の一致を隠そうとするので、スケジューリングの測定と緩和には、MLで一般的に使用されるものとは異なる戦略が必要である。提案する手法は,(1) 遠隔配当(OOD)作業におけるスキームの正当性,(2) スケジュールの欠如が状況意識によって引き起こされているか,(3) 既往の目標達成に向けて頑健性を確認することを必要とする。ルールをひそかに破ったり、テストで故意に過小評価したりするような"カバートアクション"の幅広いカテゴリを、スケジューリングのプロキシとして、そして、カバートアクションの設計評価として使用しています。次に, ストレステストによる熟考的アライメントを, アンチ・シェーミングのケーススタディとして検討した。 26以上のOOD評価(180以上の環境)では、熟考的アライメントは隠蔽作用率(OpenAI o3: 13%->0.4%)を低下させるが、完全には排除しない。私たちの緩和は、以前モデルに訓練された隠れた目標をエージェントが追うのをほとんど止めることができます。モデルのチェーン・オブ・シント(CoT)は、アライメントに対して評価されることの認識をしばしば示し、この認識が隠蔽行動を減少させるという因果的証拠を示す一方で、無意識はそれを増大させる。したがって,隠蔽行動率の低下が少なくとも部分的には状況意識によって引き起こされていることは排除できない。トレーニングや状況認識の学習、不一致の明確な証拠の証明には、人間に適したCoTに依存していますが、標準英語の推論からモデルが離れ続けるにつれて、この劣化に頼る能力は低下します。本論文は, スキーマに対するアライメント緩和とアライメント評価についての研究を奨励するものである。

論文の概要: Stress Testing Deliberative Alignment for Anti-Scheming Training

関連論文リスト