Fugu-MT 論文翻訳(概要): Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

論文の概要: Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

arxiv url: http://arxiv.org/abs/2605.06161v1
Date: Thu, 07 May 2026 12:49:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.791428
Title: Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
Title（参考訳）: LLMの安全審査の信頼性テストとしての政策不変性
Authors: Shihao Weng, Yang Feng, Xiaofei Xie,
Abstract要約: LLM-as-a-Judgeパイプラインは、エージェント安全性のデファクト評価器となっている。既存のベンチマークでは、評定がエージェントの行動に依存するか、それとも単に評価方針がどう語られるかをチェックすることなく、その評定を根底からのプロキシとして扱う。我々は、証明された等価な書き換えの下でのルーブリック・セマンティック不変性、意図的な厳密なシフトの下でのルーブリック・スレッショルド不変性、曖昧さを意識したキャリブレーションの3つの検証可能な原則として運用する。
参考スコア（独自算出の注目度）: 26.595399077062638
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM-as-a-Judge pipelines have become the de facto evaluator for agent safety, yet existing benchmarks treat their verdicts as ground-truth proxies without checking whether the verdicts depend on the agent's behavior or merely on how the evaluation policy happens to be worded. We argue that any trustworthy safety judge must satisfy a basic property we call policy invariance, and we operationalize it as three testable principles: rubric-semantics invariance under certified-equivalent rewrites, rubric-threshold invariance under intentional strict-to-lenient shifts, and ambiguity-aware calibration so that verdict instability concentrates on genuinely ambiguous cases. Instantiating these principles as a stress-test protocol with four agent-class judges on trajectories drawn from ASSEBench and R-Judge, we surface a previously unmeasured failure mode: today's judges respond to meaningful normative shifts and to meaningless structural rewrites with comparable strength, and cannot tell the two apart. Content-preserving policy rewrites flip up to 9.1% of verdicts above baseline jitter, and 18-43% of all observed flips occur on unambiguous cases under such rewrites, so existing safety scores conflate what the agent did with how the evaluator was prompted. Beyond the diagnosis, we contribute the Policy Invariance Score and the Judge Card reporting protocol, which expose an order-of-magnitude spread in judge reliability that is invisible to accuracy-only leaderboards. We release the protocol and code so that future agent-safety benchmarks can audit their own evaluators rather than trust them by default.
Abstract（参考訳）: LLM-as-a-Judgeパイプラインはエージェントの安全性のデファクト評価ツールとなっているが、既存のベンチマークでは、検証がエージェントの振る舞いに依存するか、単に評価ポリシーがどのように語られるかをチェックすることなく、彼らの判断を地道プロキシとして扱う。信頼に値する安全判事は、政策不変性と呼ばれる基本的特性を満たさなければならないとし、認定された等価な書き直しの下でのルーブリック・セマンティックな不変性、意図的な厳密なシフトの下でのルーブリック・スレッショルドな不変性、真にあいまいなケースに集中するように曖昧さを意識した校正という3つの証明可能な原則として運用する。これらの原則を、ASSEBenchとR-Judgeから引き出されたトラジェクトリに関する4人のエージェントクラスの裁判官によるストレステストプロトコルとして実証し、これまで計測されていなかった障害モードを提示する:今日の裁判官は、有意義な規範的なシフトに応答し、同じ強度で無意味な構造的な書き直しに応答し、両者を区別することができない。コンテンツ保存ポリシーの書き直しは、ベースラインジッター上の評定の9.1%に跳ね返り、観察されたすべてのフリップの18-43%は、このような書き直しの下で不明瞭なケースで発生し、既存の安全スコアは、エージェントが評価者がどのように引き起こされたかを説明する。診断の他に、ポリシ不変スコアとジャッジカードレポートプロトコルをコントリビュートし、精度のみのリーダーボードには見えない、判断信頼性のオーダー・オブ・マグニチュードを公開します。我々はプロトコルとコードを公開し、将来のエージェントセーフティベンチマークがデフォルトで信頼するのではなく、彼ら自身の評価を監査できるようにします。

論文の概要: Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

関連論文リスト