Fugu-MT 論文翻訳(概要): CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D

論文の概要: CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D

arxiv url: http://arxiv.org/abs/2511.09904v2
Date: Tue, 18 Nov 2025 06:58:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-19 13:59:16.579102
Title: CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D
Title（参考訳）: CTRL-ALT-DECEIT:自動AI研究開発のためのサボタージュ評価
Authors: Francis Rhys Ward, Teun van der Weij, Hanna Gábor, Sam Martin, Raja Mehta Moreno, Harel Lidar, Louis Makower, Thomas Jodrell, Lauren Robson,
Abstract要約: 機械学習(ML)エンジニアリングを行う際に,AIエージェントがユーザの興味に反して行動する能力について検討する。現実的なMLタスクのベンチマークであるMLE-Benchを拡張し、バックドアを埋め込んだり、意図的に一般化の失敗を引き起こすようなコードサボタージュタスクを実行します。我々は、不審なエージェントの挙動を検出するためにLMモニターを使用し、これらのモニターによって検出されることなく、モデル能力をサボタージュやサンドバッグに測定する。
参考スコア（独自算出の注目度）: 4.230181169227057
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI systems are increasingly able to autonomously conduct realistic software engineering tasks, and may soon be deployed to automate machine learning (ML) R&D itself. Frontier AI systems may be deployed in safety-critical settings, including to help ensure the safety of future systems. Unfortunately, frontier and future systems may not be sufficiently trustworthy, and there is evidence that these systems may even be misaligned with their developers or users. Therefore, we investigate the capabilities of AI agents to act against the interests of their users when conducting ML engineering, by sabotaging ML models, sandbagging their performance, and subverting oversight mechanisms. First, we extend MLE-Bench, a benchmark for realistic ML tasks, with code-sabotage tasks such as implanting backdoors and purposefully causing generalisation failures. Frontier agents make meaningful progress on our sabotage tasks. In addition, we study agent capabilities to sandbag on MLE-Bench. Agents can calibrate their performance to specified target levels below their actual capability. To mitigate sabotage, we use LM monitors to detect suspicious agent behaviour, and we measure model capability to sabotage and sandbag without being detected by these monitors. Overall, monitors are capable at detecting code-sabotage attempts but our results suggest that detecting sandbagging is more difficult. Additionally, aggregating multiple monitor predictions works well, but monitoring may not be sufficiently reliable to mitigate sabotage in high-stakes domains. Our benchmark is implemented in the UK AISI's Inspect framework and we make our code publicly available at https://github.com/TeunvdWeij/ctrl-alt-deceit
Abstract（参考訳）: AIシステムは、現実的なソフトウェアエンジニアリングタスクを自律的に実行し、間もなく機械学習(ML)R&D自体を自動化するためにデプロイされる可能性がある。フロンティアAIシステムは、将来のシステムの安全性を保証することを含む、安全クリティカルな設定でデプロイされる可能性がある。残念ながら、フロンティアと将来のシステムは十分に信頼できないかもしれないし、これらのシステムが開発者やユーザと不一致である可能性さえある。そこで我々は,機械学習モデルを妨害し,パフォーマンスをサンドバッグ化し,監視機構を覆すことによって,AIエージェントが,MLエンジニアリングを行う際のユーザの利益に対して行動する能力について検討する。まず、現実的なMLタスクのベンチマークであるMLE-Benchを拡張し、バックドアを埋め込んだり、汎用的障害を意図的に発生させたりといったコードサボタージュタスクを実行します。最前線のエージェントは、サボタージュのタスクに有意義な進歩をもたらす。さらに,MLE-Bench上のサンドバッグのエージェント機能について検討した。エージェントはパフォーマンスを、実際の能力以下の特定のターゲットレベルに調整することができる。サボタージュを緩和するために、疑わしいエージェントの挙動を検出するためにLMモニターを使用し、これらのモニターによって検出されることなく、サボタージュとサンドバッグのモデル能力を測定する。全体としては、モニターはコード・サボタージュの試行を検知できるが、この結果は、サンドバッグの検出がより困難であることを示唆している。さらに、複数の監視予測を集約することはうまく機能するが、監視は高い領域におけるサボタージュを緩和するのに十分な信頼性がない可能性がある。私たちのベンチマークは、英国AISIのInspectフレームワークで実装されており、コードをhttps://github.com/TeunvdWeij/ctrl-alt-deceitで公開しています。

論文の概要: CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D

関連論文リスト