Fugu-MT 論文翻訳(概要): CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs

論文の概要: CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs

arxiv url: http://arxiv.org/abs/2606.11063v1
Date: Tue, 09 Jun 2026 16:24:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-10 15:40:58.610986
Title: CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs
Title（参考訳）: CIAware-Bench:Frontier LLM全体での制御介入意識のベンチマーク
Authors: Joachim Schaeffer, Thomas Jiralerspong, Alexander Panfilov, Guillaume Lajoie, Jonas Geiping, Yoshua Bengio, Roland S. Zimmermann,
Abstract要約: 我々は、フロンティアモデル全体でのtextbfcontrol textbfintervention (CI) の認識を測定するベンチマークである textbfCIAware-Bench を紹介する。 CIAware-Benchは、モデルが自身の軌跡を制御介入によって修正されたものと区別できるかどうかをテストする。
参考スコア（独自算出の注目度）: 100.38986535324284
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI control protocols oversee untrusted models by monitoring their actions and modifying potentially unsafe steps, often using a trusted model. This partially tampers with the untrusted model's trajectory. If the trusted model detects such an intervention, it may infer properties of the monitor and adapt to evade control. We introduce \textbf{CIAware-Bench}, a benchmark for measuring \textbf{c}ontrol \textbf{i}ntervention (CI) awareness across frontier models. CIAware-Bench tests whether models can distinguish their own trajectories from those modified by a control intervention. The benchmark is comprised of a suite of four task domains (essay writing, BigCodeBench, Bash Arena, and SHADE-Arena), while varying trajectory watermarking, side-task presence, and the control protocol. Evaluating eleven frontier models, we find low to moderate CI awareness under default settings (up to 0.87; random chance balanced binary classification accuracy is 0.5) with substantial variation across task domains and model pairs. Detection is generally easier across model families, suggesting that models exploit provider-specific differences in style or post-training. Overall, CI awareness is not a fixed model-level property, and should be measured for each new model release and deployment scenario. We release CIAware-Bench to track CI awareness and inform control protocols whose interventions are harder to detect.
Abstract（参考訳）: AIコントロールプロトコルは、信頼できないモデルを監視し、しばしば信頼されたモデルを使用して、彼らのアクションを監視し、潜在的に安全でないステップを修正する。これは、信頼できないモデルの軌道に部分的に干渉する。信頼されたモデルがそのような介入を検出すると、モニターの特性を推測し、回避制御に適応する。我々は、フロンティアモデル全体での \textbf{c}ontrol \textbf{i}ntervention (CI) の認識を測定するためのベンチマークである \textbf{CIAware-Bench} を紹介する。 CIAware-Benchは、モデルが自身の軌跡を制御介入によって修正されたものと区別できるかどうかをテストする。ベンチマークは4つのタスクドメイン(例えば、BigCodeBench、Bash Arena、SHADE-Arena)で構成される。 11つのフロンティアモデルを評価すると、デフォルト設定(0.87まで、ランダムな確率バランスのバイナリ分類精度は0.5)の下で、タスクドメインとモデルペア間でかなりのばらつきがある、CIの認識が低から中程度であることが分かる。検出は一般的にモデルファミリ間で容易であり、モデルが提供者固有のスタイルやポストトレーニングの違いを悪用していることを示唆している。全体として、CIの認識はモデルレベルの固定プロパティではなく、新しいモデルのリリースとデプロイメントのシナリオ毎に測定されるべきである。私たちはCIAware-Benchをリリースし、CIの認識を追跡し、介入を検出するのが難しい制御プロトコルを通知します。

論文の概要: CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs

関連論文リスト