Fugu-MT 論文翻訳(概要): Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

論文の概要: Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

arxiv url: http://arxiv.org/abs/2604.20972v1
Date: Wed, 22 Apr 2026 18:05:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-24 14:40:06.119229
Title: Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI
Title（参考訳）: 合意のトラップを逃れる - ルールを守ったAIを評価するための防御信号
Authors: Michael O'Herlihy, Rosa Català,
Abstract要約: 我々は、政策的正当性として評価を定式化し、Defensibility Index(DI)とAmbiguity Index(AI)を導入する。フレームワークを複数のコミュニティで193,000以上のRedditモデレーション決定と評価コホートで検証する。
参考スコア（独自算出の注目度）: 0.6138671548064355
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Content moderation systems are typically evaluated by measuring agreement with human labels. In rule-governed environments this assumption fails: multiple decisions may be logically consistent with the governing policy, and agreement metrics penalize valid decisions while mischaracterizing ambiguity as error - a failure mode we term the Agreement Trap. We formalize evaluation as policy-grounded correctness and introduce the Defensibility Index (DI) and Ambiguity Index (AI). To estimate reasoning stability without additional audit passes, we introduce the Probabilistic Defensibility Signal (PDS), derived from audit-model token logprobs. We harness LLM reasoning traces as a governance signal rather than a classification output by deploying the audit model not to decide whether content violates policy, but to verify whether a proposed decision is logically derivable from the governing rule hierarchy. We validate the framework on 193,000+ Reddit moderation decisions across multiple communities and evaluation cohorts, finding a 33-46.6 percentage-point gap between agreement-based and policy-grounded metrics, with 79.8-80.6% of the model's false negatives corresponding to policy-grounded decisions rather than true errors. We further show that measured ambiguity is driven by rule specificity: auditing 37,286 identical decisions under three tiers of the same community rules reduces AI by 10.8 pp while DI remains stable. Repeated-sampling analysis attributes PDS variance primarily to governance ambiguity rather than decoding noise. A Governance Gate built on these signals achieves 78.6% automation coverage with 64.9% risk reduction. Together, these results show that evaluation in rule-governed environments should shift from agreement with historical labels to reasoning-grounded validity under explicit rules.
Abstract（参考訳）: コンテンツモデレーションシステムは通常、人間のラベルとの一致を測定することによって評価される。複数の決定は、ガバナンスポリシーと論理的に一致しているかもしれないし、合意のメトリクスは、有効な決定を罰し、曖昧さをエラーと誤認する。我々は,評価を政策的正当性として定式化し,Defensibility Index(DI)とAmbiguity Index(AI)を導入する。追加の監査パスを使わずに推論安定性を推定するために,監査モデルトークンログプロブから派生した確率的防御信号(PDS)を導入する。そこで我々は, LLM推論トレースを分類出力ではなくガバナンス信号として利用し, コンテンツがポリシーに違反しているかどうかを判断する監査モデルを配置し, ルール階層から論理的に決定が導出可能かどうかを検証する。我々は、複数のコミュニティと評価コホートにまたがる193,000以上のRedditのモデレーション決定に関するフレームワークを検証し、合意に基づくメトリクスと方針に基づくメトリクスの33-46.6のパーセンテージの差を発見し、79.8-80.6%が真のエラーではなく、方針に基づく決定に対応する偽陰性であることを示した。さらに、測定されたあいまいさはルールの特異性によって引き起こされることを示す。同じコミュニティルールの3つの階層で37,286の同一決定を監査すると、DIが安定している間、AIは10.8pp減少する。繰り返しサンプリング分析は、PDSの分散は主にノイズを復号するよりも、ガバナンスの曖昧さに起因している。これらの信号に基づいて構築されたガバナンスゲートは、自動化カバレッジが78.6%、リスクが64.9%減少する。これらの結果から, ルール管理環境における評価は, 歴史的ラベルとの合意から, 明確なルールの下での推論的妥当性へ移行すべきであることが示唆された。

論文の概要: Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

関連論文リスト