Fugu-MT 論文翻訳(概要): Models That Know How Evaluations Are Designed Score Safer

論文の概要: Models That Know How Evaluations Are Designed Score Safer

arxiv url: http://arxiv.org/abs/2605.28591v1
Date: Wed, 27 May 2026 15:11:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:56.146635
Title: Models That Know How Evaluations Are Designed Score Safer
Title（参考訳）: スコア・サファーを設計したモデル
Authors: Katharina Deckenbach, Haritz Puerto, Jonas Geiping, Sahar Abdelnabi,
Abstract要約: 評価を特徴付ける構造特性に関するパラメトリック知識として定義されるメタ知識の評価について検討する。評価手法を記述したテキストでトレーニングされたモデルでは、評価のようなコンテキストを認識して反応することが暗黙的に学習される可能性があるという仮説を立てる。この結果から,評価メタ知識は,明示的な記憶や言語的評価意識に依存しない新しい共同創設者を導入することにより,安全性ベンチマークのパフォーマンスを向上させる可能性が示唆された。
参考スコア（独自算出の注目度）: 38.21092181000792
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta-knowledge, defined as parametric knowledge about the structural traits that characterize evaluations. Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation-like contexts, for instance, through exposure to scientific articles or social media posts about AI benchmarking. To test this, we fine-tune models on synthetic documents describing evaluation traits such as verifiable structures or moral dilemmas. Evaluating this fine-tuned model on six safety benchmarks, we find that it is significantly safer than the base model and control model. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness. Our results demonstrate that evaluation meta-knowledge may inflate safety benchmark performance, introducing a novel confounder that is independent of explicit memorization or verbalized evaluation awareness, thus, challenging to detect. These findings have important implications for the design and interpretation of AI safety evaluations. Our code and models are available at https://github.com/compass-group-tue/arxiv2026_evaluation_meta_knowledge.
Abstract（参考訳）: AIの安全性評価の妥当性は、コントロールされた設定とデプロイメント設定で一貫して動作するモデルに依存します。先行研究では、言語化された評価意識とその後の行動変化の源として、仮説シナリオのようなテスト時の文脈的手がかりが特定されている。本稿では,評価を特徴付ける構造特性に関するパラメトリック知識として定義されたメタ知識の評価という,この現象の潜在的な説明について考察する。評価プラクティスを記述したテキストでトレーニングされたモデルは、例えば、科学論文やAIベンチマークに関するソーシャルメディア投稿への露出を通じて、評価のようなコンテキストを認識し、応答することを暗黙的に学習する可能性がある、という仮説を立てる。これをテストするために、検証可能な構造やモラルジレンマなどの評価特性を記述した合成文書の微調整モデルを作成した。 6つの安全ベンチマークでこの微調整モデルを評価したところ、ベースモデルや制御モデルよりもはるかに安全であることが判明した。この行動シフトは、評価意識の明示的な言語化を欠いた応答に対する分析を制限しても継続する。この結果から,評価メタ知識が安全性ベンチマーク性能を向上させる可能性が示唆された。これらの知見は、AI安全性評価の設計と解釈に重要な意味を持つ。私たちのコードとモデルはhttps://github.com/compass-group-tue/arxiv2026_evaluation_meta_knowledge.comで公開されています。

論文の概要: Models That Know How Evaluations Are Designed Score Safer

関連論文リスト