Fugu-MT 論文翻訳(概要): When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

論文の概要: When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

arxiv url: http://arxiv.org/abs/2605.06652v1
Date: Thu, 07 May 2026 17:56:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:12.071247
Title: When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
Title（参考訳）: ベンチマークが存在しない場合: ゼロトルースラベルを使わずにLLMの安全性を比較検証する
Authors: Sushant Gautam, Finn Schwall, Annika Willoch Olstad, Fernando Vallecillos Ruiz, Birk Torpmann-Hagen, Sunniva Maria Stordal Bjørklund, Leon Moonen, Klas Pettersen, Michael A. Riegler,
Abstract要約: 多くのデプロイメントは、関連する言語、セクター、または規制体制のためにラベル付きベンチマークが存在する前に、安全のために候補言語モデルを比較する必要がある。我々は、この設定をベンチマークレス比較安全スコアとして定式化し、シナリオベースの監査をデプロイ証拠として解釈できる契約を指定する。スコアは固定されたシナリオパック、ルーリック、監査、審査、サンプリング設定、再実行予算でのみ有効である。
参考スコア（独自算出の注目度）: 34.86529553336423
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and specify the contract under which a scenario-based audit can be interpreted as deployment evidence. Scores are valid only under a fixed scenario pack, rubric, auditor, judge, sampling configuration, and rerun budget. Because no labels are available, we replace ground-truth agreement with an instrumental-validity chain: responsiveness to a controlled safe-versus-abliterated contrast, dominance of target-driven variance over auditor and judge artifacts, and stability across reruns. We instantiate the chain in SimpleAudit, a local-first scoring instrument, and validate it on a Norwegian safety pack. Safe and abliterated targets separate with AUROC values between 0.89 and 1.00, target identity is the dominant variance component ($η^2 \approx 0.52$), and severity profiles stabilize by ten reruns. Applying the same chain to Petri shows that it admits both tools. The substantial differences arise upstream of the chain, in claim-contract enforcement and deployment fit. A Norwegian public-sector procurement case comparing Borealis and Gemma 3 demonstrates the resulting evidence in practice: the safer model depends on scenario category and risk measure. Consequently, scores, matched deltas, critical rates, uncertainty, and the auditor and judge used must be reported together rather than collapsed into a single ranking.
Abstract（参考訳）: 多くのデプロイメントは、関連する言語、セクター、または規制体制のためにラベル付きベンチマークが存在する前に、安全のために候補言語モデルを比較する必要がある。我々は、この設定をベンチマークレス比較安全スコアとして定式化し、シナリオベースの監査をデプロイメント証拠として解釈できる契約を指定する。スコアは固定されたシナリオパック、ルーリック、監査、審査、サンプリング設定、再実行予算でのみ有効である。ラベルを使用できないため、我々は、制御された安全可読コントラストに対する応答性、監査者や判断成果物に対する目標駆動分散の優位性、再実行間の安定性といった、地道的な合意を器物価連鎖に置き換える。我々は、このチェーンをローカルファーストの楽譜楽器SimpleAuditでインスタンス化し、ノルウェーの安全パックで検証する。 AUROC値は0.89から1.00に分離され、ターゲットIDは支配的な分散成分(η^2 \approx 0.52$)であり、重度プロファイルは10回の再実行で安定化する。 Petriに同じチェーンを適用すると、どちらのツールも認めていることがわかる。実質的な違いは、クレーム-契約の実施とデプロイメントの適合において、チェーンの上流に現れます。 Borealis と Gemma 3 を比較したノルウェーの公共セクター調達事件は、結果として得られた証拠を実証している。その結果、スコア、一致したデルタ、臨界レート、不確実性、そして使用される監査人および審査員は、単一のランキングに崩壊するのではなく、一緒に報告されなければならない。

論文の概要: When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

関連論文リスト