Fugu-MT 論文翻訳(概要): Gaming the Metric, Not the Harm: Certifying Safety Audits against Strategic Platform Manipulation

論文の概要: Gaming the Metric, Not the Harm: Certifying Safety Audits against Strategic Platform Manipulation

arxiv url: http://arxiv.org/abs/2605.06324v1
Date: Thu, 07 May 2026 14:22:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.889538
Title: Gaming the Metric, Not the Harm: Certifying Safety Audits against Strategic Platform Manipulation
Title（参考訳）: 戦略的プラットフォーム操作に対する安全監査の認定
Authors: Florian A. D. Burnat, Brittany I. Davidson,
Abstract要約: 監査基準が真の害の減少を証明できるかどうかを問う。このプロトコルは、接続されたコンポーネントがセマンティッククラスを形成する公開変換グラフとしてモデル化されている。混合戦略の有限状態グリッド上での徹底的な列挙、cvc5でクロスリプレイされたZ3のSMTエンコーディング、PRISMゲームでエンコードされた単一プレイヤーMDPである。
参考スコア（独自算出の注目度）: 1.253312107729806
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Online-safety regulation under the UK Online Safety Act and the EU Digital Services Act increasingly treats scalar metrics as compliance evidence. Once announced, such a metric also becomes an optimization target: a strategic platform can improve its score by routing recommendations through semantically equivalent content variants, without reducing true harm. We ask when such an audit metric can still certify a genuine reduction in harm. The protocol is modeled as a published transformation graph whose connected components form semantic classes, and the metric itself is treated as a security object. Three results follow. First, any metric that scores variants directly is manipulable as soon as two equivalent variants in a harmful class disagree in score. Second, the semantic-envelope lift, which assigns each variant the maximum score in its class, is the unique pointwise minimum among conservative classwise-constant repairs. Third, a class-stratified certificate, $H^\star(x) \le (1/\hatα) M_{\mathrm{Env}(m)}(x) + \barη$, holds for every platform strategy, with $\barη$ absorbing annotation and protocol error. We check the claims at three levels: exhaustive enumeration on a finite-state grid of mixed strategies, an SMT encoding in Z3 cross-replayed in cvc5, and a bounded single-player MDP encoded in PRISM-games. The fragile metric fails manipulation invariance and cannot support the same useful predeclared class-coverage certificate; under the envelope-level certificate, it produces large violations at every tested instance, with a large mean gaming gap across random catalogs at a fixed audit budget. The semantic-envelope metric exhibits no such violation in the tested instances.
Abstract（参考訳）: イギリスオンライン安全法とEUデジタルサービス法に基づくオンライン安全規制は、スカラーメトリクスをコンプライアンスの証拠として扱いつつある。戦略的プラットフォームは、真の害を軽減することなく、意味的に等価なコンテンツバリアントを通じてレコメンデーションをルーティングすることで、スコアを改善することができる。このような監査基準が真の害の減少を証明できるかどうかを問う。プロトコルは、接続されたコンポーネントがセマンティッククラスを形成する公開変換グラフとしてモデル化され、メトリック自体がセキュリティオブジェクトとして扱われる。 3つの結果が続く。第一に、変量を直接得点する任意の計量は、有害なクラスにおける2つの等価な変量がスコアに不一致するとすぐに操作可能である。第二に、各変種をそのクラスで最大スコアに割り当てるセマンティック・エンベロープ・リフトは、保守的なクラスワイズ・コンスタント修復の中では唯一のポイントワイドの最小値である。第3に、クラス階層化された証明書である$H^\star(x) \le (1/\hatα) M_{\mathrm{Env}(m)}(x) + \barη$は、アノテーションとプロトコルエラーを吸収する$\barη$を含むすべてのプラットフォーム戦略を保持できる。混合戦略の有限状態グリッド上での徹底的な列挙、cvc5でクロスリプレイされたZ3のSMTエンコーディング、PRISMゲームでエンコードされた単一プレイヤーMDPである。封筒レベルの証明書では、テスト対象のインスタンス毎に大きな違反を発生させ、固定された監査予算でランダムカタログ間の平均的なゲームギャップを大きくする。セマンティックエンベロープ計量は、テストインスタンスにそのような違反は示さない。

論文の概要: Gaming the Metric, Not the Harm: Certifying Safety Audits against Strategic Platform Manipulation

関連論文リスト