Fugu-MT 論文翻訳(概要): Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring

論文の概要: Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring

arxiv url: http://arxiv.org/abs/2602.04930v1
Date: Wed, 04 Feb 2026 10:42:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-06 18:49:08.553746
Title: Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring
Title（参考訳）: 信頼された監視に対する集中型AI制御設定におけるアタック選択による安全性の低減
Authors: Joachim Schaeffer, Arjun Khandelwal, Tyler Tracy,
Abstract要約: 我々は、AIがモニターを逆選択することで、モニターを逆転させる能力について研究する。攻撃選択を2つの問題に分解する: 攻撃を品質スコアにマッピングし、品質スコアを確率にマッピングする。攻撃モデルが攻撃選択に慎重でありながら、監視を判断するよう促すことで、監査予算0.5%において、ベースラインの99%から59%の安全性が低下することがわかった。
参考スコア（独自算出の注目度）: 13.259219714721242
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Future AI deployments will likely be monitored for malicious behaviour. The ability of these AIs to subvert monitors by adversarially selecting against them - attack selection - is particularly concerning. To study this, we let a red team create attack policies that attempt to insert attacks into code without being caught by a monitor in the concentrated BigCodeBench backdooring setting. We decompose attack selection into two problems: mapping attacks to a quality score and mapping quality scores to submission probabilities. We frame attack selection as a classification problem and show that safety is significantly more sensitive to FPR than TPR. We find that prompting the attacker model to reason about the monitor while being cautious with attack selection reduces safety from a baseline of 99% to 59% at 0.5% auditing budget, emphasizing that eliciting attack selection capabilities of models is vital to avoid overly optimistic safety scores in control evaluations.
Abstract（参考訳）: 将来のAIデプロイメントは、悪意のある振る舞いのために監視される可能性が高い。これらのAIがモニターを逆選択する能力 – 攻撃選択 – を特に関係している。これを研究するために、Redチームは、集中したBigCodeBenchバックドア設定でモニターに捕捉されることなく、コードにアタックを挿入しようとするアタックポリシーを作成しました。攻撃選択を2つの問題に分解する: 攻撃を品質スコアにマッピングし、品質スコアを確率にマッピングする。攻撃選択は分類問題であり,安全性はTPRよりもFPRにかなり敏感であることを示す。攻撃選択に注意しながら、攻撃モデルに監視を判断させると、基準ラインの99%から59%までの安全性が 0.5% の監査予算で低下し、制御評価において過度に楽観的な安全スコアを避けるためには、モデルの攻撃選択能力を引き出すことが不可欠である、と強調する。

論文の概要: Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring

関連論文リスト