Fugu-MT 論文翻訳(概要): Prefill Awareness in Large Language Models

論文の概要: Prefill Awareness in Large Language Models

arxiv url: http://arxiv.org/abs/2606.12747v1
Date: Wed, 10 Jun 2026 23:26:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-12 15:55:27.500552
Title: Prefill Awareness in Large Language Models
Title（参考訳）: 大規模言語モデルにおける準備的認識
Authors: Andy Wang, Parv Mahajan, David Demitri Africa, Alexandra Souly, Jordan Taylor, Robert Kirk,
Abstract要約: 本研究は,言語モデルが,教師なしと教師なしのアシスタント側コンテキストを区別できるかどうかを考察する。私たちはフロンティアモデルに十分な事前認識があることに気付きました。以上の結果から, プリフィルの意識は, 既にいくつかのプリフィル方式にかなり相反していることが示唆された。
参考スコア（独自算出の注目度）: 42.57596462680195
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Safety-relevant studies of language models, including alignment and jailbreaking evaluations and AI control protocols, often rely on prefilling model outputs. If AI models can recognize and act on the fact their prior assistant messages have been inserted or edited, the effectiveness and validity of these methods could be compromised. We investigate whether frontier language models can distinguish between tampered and untampered assistant-side context, a capability we call prefill awareness. To do so, we construct a binary preference benchmark across three prefill mechanisms, filtering for cases where models show consistent stances. We find that frontier models show substantial prefill awareness: Claude Opus 4.5 detects prefills opposing its preferences in 9-35% of cases with a 0% false positive rate when prompted; additionally, models often revert towards baseline behavior without explicitly reporting that the prefill was foreign. Controlled ablations later also show that detection and resistance rely on different cues, where stylistic mismatch mainly affects whether models flag a prefill as foreign, while preference mismatch mainly affects whether they revert toward their baseline answer. We also examine more realistic agentic settings such as misalignment-continuation evaluations and SWE-bench trajectories, where frontier models sometimes disavow prefilled assistant turns in ways that depend strongly on dataset, task success, and hidden formatting artifacts. Our results indicate that prefill awareness is already a substantial confound for some prefill-based methods. We recommend that model developers track this capability in frontier systems.
Abstract（参考訳）: アライメントやジェイルブレイク評価、AI制御プロトコルなど、言語モデルの安全性に関する研究は、しばしばモデル出力のプリフィルに依存している。もしAIモデルが、以前のアシスタントメッセージが挿入されたり編集されたりしたという事実を認識し、行動することができるなら、これらの方法の有効性と妥当性は損なわれる可能性がある。我々は,フロンティア言語モデルが,事前充足認識(prefill awareness)と呼ばれる,改ざんされたアシスタント側コンテキストと未改ざんされたアシスタント側コンテキストを区別できるかどうかを検討する。そこで我々は,モデルが一貫した姿勢を示す場合のフィルタとして,三つのプリフィル機構にまたがる二分選好ベンチマークを構築した。クロードオプス4.5は、刺激が0%の偽陽性率の9～35%のケースにおいて、その嗜好に反するプリフィルを検出する。制御された説明は後に、検出と抵抗は異なる手がかりに依存しており、スタイリスティックなミスマッチは、主にモデルがプレフィルを外国人としてフラグづけするかどうかに影響を及ぼし、一方、好みのミスマッチは、彼らがベースラインの答えに逆戻りするかどうかに大きく影響していることを示している。また、フロンティアモデルでは、データセット、タスク成功、隠されたフォーマットアーティファクトに強く依存する方法で、フロンティアモデルが時には未完成のアシスタントを回避できるような、ミスアライメント・コンティニュエーション評価やSWE-ベンチ・トラジェクトリといったより現実的なエージェント設定についても検討する。以上の結果から, プリフィル認知は, 既にいくつかのプリフィルベース手法のかなりの相違点であることが示唆された。私たちはモデル開発者がこれをフロンティアシステムで追跡することを推奨します。

論文の概要: Prefill Awareness in Large Language Models

関連論文リスト