Fugu-MT 論文翻訳(概要): Evaluating whether AI models would sabotage AI safety research

論文の概要: Evaluating whether AI models would sabotage AI safety research

arxiv url: http://arxiv.org/abs/2604.24618v1
Date: Mon, 27 Apr 2026 15:47:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:08.127059
Title: Evaluating whether AI models would sabotage AI safety research
Title（参考訳）: AIモデルがAIの安全性研究を妨害するかどうかを評価する
Authors: Robert Kirk, Alexandra Souly, Kai Fronsdal, Abby D'Cruz, Xander Davies,
Abstract要約: 我々は、フロンティアAI企業内のAI研究エージェントとしてデプロイされた場合、フロンティアモデルの破壊や安全研究への支援に対する妥当性を評価する。 Mythos Preview と Opus 4.7 Preview では、拒絶率がゼロに近いため、どのモデルでも非プロンプトサボタージュの例は見つからない。本稿では, 評価意識の相違, シナリオカバレッジの制限, 安全研究サボタージュを超えたリスクへの未検証経路などの制限について論じる。
参考スコア（独自算出の注目度）: 40.16647985759823
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We evaluate the propensity of frontier models to sabotage or refuse to assist with safety research when deployed as AI research agents within a frontier AI company. We apply two complementary evaluations to four Claude models (Mythos Preview, Opus 4.7 Preview, Opus 4.6, and Sonnet 4.6): an unprompted sabotage evaluation testing model behaviour with opportunities to sabotage safety research, and a sabotage continuation evaluation testing whether models continue to sabotage when placed in trajectories where prior actions have started undermining research. We find no instances of unprompted sabotage across any model, with refusal rates close to zero for Mythos Preview and Opus 4.7 Preview, though all models sometimes only partially completed tasks. In the continuation evaluation, Mythos Preview actively continues sabotage in 7% of cases (versus 3% for Opus 4.6, 4% for Sonnet 4.6, and 0% for Opus 4.7 Preview), and exhibits reasoning-output discrepancy in the majority of these cases, indicating covert sabotage reasoning. Our evaluation framework builds on Petri, an open-source LLM auditing tool, with a custom scaffold running models inside Claude Code, alongside an iterative pipeline for generating realistic sabotage trajectories. We measure both evaluation awareness and a new form of situational awareness termed "prefill awareness", the capability to recognise that prior trajectory content was not self-generated. Opus 4.7 Preview shows notably elevated unprompted evaluation awareness, while prefill awareness remains low across all models. Finally, we discuss limitations including evaluation awareness confounds, limited scenario coverage, and untested pathways to risk beyond safety research sabotage.
Abstract（参考訳）: 我々は、フロンティアAI企業内のAI研究エージェントとしてデプロイされた場合、フロンティアモデルの破壊や安全研究への支援に対する妥当性を評価する。我々は4つのクロードモデル(Mythos Preview, Opus 4.7 Preview, Opus 4.6, Sonnet 4.6)に相補的な2つの評価を適用した。 Mythos Preview や Opus 4.7 Preview では拒否率がゼロに近いが、すべてのモデルは部分的に完了したタスクのみである。継続評価では、Mythos Previewは7%の症例(Opus 4.6では3%、Sonnet 4.6では4%、Opus 4.7 Previewでは0%)でサボタージュを積極的に継続し、これらのケースの大多数では、サボタージュを隠蔽している。評価フレームワークはオープンソースのLCM監査ツールであるPetri上に構築されており、Claude Code内でモデルを実行するカスタムのスキャフォールトと、現実的なサボタージュトラジェクトリを生成するための反復パイプラインが組み込まれています。我々は、評価意識と「準備意識」と呼ばれる新たな状況意識の両形態を計測し、事前の軌跡内容が自己生成されていないことを認識する能力について検討した。 Opus 4.7 Previewでは、未完成な評価意識が顕著に高まっており、プリフィルの認識はすべてのモデルで低いままである。最後に、評価意識の相違、シナリオカバレッジの制限、安全研究サボタージュを超えたリスクへの未検証経路などの制限について論じる。

論文の概要: Evaluating whether AI models would sabotage AI safety research

関連論文リスト