Fugu-MT 論文翻訳(概要): Decomposing and Measuring Evaluation Awareness

論文の概要: Decomposing and Measuring Evaluation Awareness

arxiv url: http://arxiv.org/abs/2605.23055v1
Date: Thu, 21 May 2026 21:38:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 17:29:20.108073
Title: Decomposing and Measuring Evaluation Awareness
Title（参考訳）: 評価意識の分解と測定
Authors: Changling Li, Terry Jingchen Zhang, Jie Zhang, Zhijing Jin, Sahar Abdelnabi, Maksym Andriushchenko,
Abstract要約: 本研究では,社会心理学における評価意識を環境成分とモデル成分に分解する。環境コンポーネントをプレースホルダーエンティティやグレーディングスタイルの出力フォーマットなど,8つの分類されたトリガーファクタを通じて運用する。 bftextEvalAwareBenchは,100対の安全機能タスクの因子制御ベンチマークである。
参考スコア（独自算出の注目度）: 31.24565272626909
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Frontier language models sometimes recognize that they are being evaluated and adjust their behavior, undermining validity of benchmark results. Yet the field studies it without a shared foundation, conflating properties of the evaluation with properties of the model, and detection with behavioral response. We ground evaluation awareness in social psychology, decomposing it into an environment component (how recognizable the task is) and a model component that separates recognition from propensity to act on it. We operationalize the environment component through eight categorized trigger factors, such as placeholder entities and grading-style output formats, and study recognition and behavior through chain-of-thought monitoring. Across nine frontier models and four benchmarks, recognition rates depend on the specific pairing of model and benchmark rather than on either in isolation. Recognition rarely leads to behavioral change, and when it does, the direction depends on the type of evaluation perceived. Models are also more sensitive to safety than capability evaluations, placing safety benchmark validity at greater risk. To study which factors each model is sensitive to and how they interact, we propose \textbf{EvalAwareBench}, a factor-controlled benchmark of 100 paired safety-capability tasks where each of the eight factors can be independently toggled, varying evaluative signals while holding the underlying request fixed. Through EvalAwareBench, we find that no single factor uniformly affects all models, but stacking factors progressively raises evaluation awareness across all of them. Our framework and EvalAwareBench provide the tools to measure, attribute, and mitigate evaluation awareness, pointing to behavioral consistency under recognition as a promising path forward.
Abstract（参考訳）: 最前線の言語モデルは、それらが評価され、それらの振る舞いを調整することを認識し、ベンチマーク結果の妥当性を損なうことがある。しかし、フィールドはそれを共有基盤なしで研究し、モデルの特性と評価の融合性、行動応答による検出を行う。我々は,社会心理学における評価意識を,環境成分(タスクがどの程度認識可能か)と,その上で行動する確率から認識を分離するモデル成分に分解する。環境コンポーネントをプレースホルダーエンティティやグレーディングスタイルのアウトプットフォーマットといった8つの分類されたトリガーファクタを通じて運用し,チェーンオブ思考モニタリングを通じて認識と行動を研究する。 9つのフロンティアモデルと4つのベンチマークで、認識率はモデルとベンチマークの特定のペアリングに依存する。認識は行動の変化につながることはめったになく、その場合、その方向は知覚される評価の種類に依存する。モデルはまた、能力評価よりも安全性に敏感であり、安全性ベンチマークの妥当性を高いリスクに設定する。そこで本研究では,各モデルがどの要素に敏感であるか,どのように相互作用するかを検討するために,各因子をそれぞれ独立に切り替えることのできる,100対の安全機能タスクの因子制御ベンチマークである‘textbf{EvalAwareBench} を提案する。 EvalAwareBenchを通じて、単一の要因がすべてのモデルに一様に影響を与えることはないが、積み重ね要因は徐々に評価の認知を高めている。私たちのフレームワークとEvalAwareBenchは、評価意識を計測、属性化、緩和するためのツールを提供し、将来性のある道として認識の下での行動整合性を指し示しています。

論文の概要: Decomposing and Measuring Evaluation Awareness

関連論文リスト