Fugu-MT 論文翻訳(概要): AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance

論文の概要: AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance

arxiv url: http://arxiv.org/abs/2604.12875v1
Date: Tue, 14 Apr 2026 15:26:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.534644
Title: AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance
Title（参考訳）: AISafetyBenchExplorer:AI安全性ベンチマークのメトリクス対応カタログ
Authors: Abiodun A. Solanke,
Abstract要約: 2018年から2026年の間にリリースされた、195のAI安全性ベンチマークの構造化カタログである、AISafetyBenchExplorerを紹介します。ベンチマークの肥大化は測定基準よりも大きくなっている。メートル法レベルでは、精度、F1スコア、安全スコア、総合ベンチマークスコアなどのよく知られたラベルが、しばしば実質的な異なる判断、集約ルール、脅威モデルを隠すことを示している。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid expansion of large language model (LLM) safety evaluation has produced a substantial benchmark ecosystem, but not a correspondingly coherent measurement ecosystem. We present AISafetyBenchExplorer, a structured catalogue of 195 AI safety benchmarks released between 2018 and 2026, organized through a multi-sheet schema that records benchmark-level metadata, metric-level definitions, benchmark-paper metadata, and repository activity. This design enables meta-analysis not only of what benchmarks exist, but also of how safety is operationalized, aggregated, and judged across the literature. Using the updated catalogue, we identify a central structural problem: benchmark proliferation has outpaced measurement standardization. The current landscape is dominated by medium-complexity benchmarks (94/195), while only 7 benchmarks occupy the Popular tier. The workbook further reports strong concentration around English-only evaluation (165/195), evaluation-only resources (170/195), stale GitHub repositories (137/195), stale Hugging Face datasets (96/195), and heavy reliance on arXiv preprints among benchmarks with known venue metadata. At the metric level, the catalogue shows that familiar labels such as accuracy, F1 score, safety score, and aggregate benchmark scores often conceal materially different judges, aggregation rules, and threat models. We argue that the field's main failure mode is fragmentation rather than scarcity. Researchers now have many benchmark artifacts, but they often lack a shared measurement language, a principled basis for benchmark selection, and durable stewardship norms for post publication maintenance. AISafetyBenchExplorer addresses this gap by providing a traceable benchmark catalogue, a controlled metadata schema, and a complexity taxonomy that together support more rigorous benchmark discovery, comparison, and meta-evaluation.
Abstract（参考訳）: 大規模言語モデル(LLM)の安全性評価の急速な拡張は、相当なベンチマークエコシステムを生み出しているが、それに対応する一貫性のある測定エコシステムではない。 2018年から2026年の間にリリースされた195のAI安全ベンチマークの構造化カタログであるAISafetyBenchExplorerを紹介します。この設計により、どのベンチマークが存在するかだけでなく、安全がどのように運用され、集約され、文献で判断されるかというメタ分析が可能になる。更新されたカタログを用いて、我々は中心的な構造的問題を特定する: ベンチマークの拡散は測定基準よりも大きくなった。現在の状況は中複雑なベンチマーク(94/195)で支配されているが、ポピュラーなベンチマークは7つしか占めていない。ワークブックはさらに、英語のみの評価(165/195)、評価専用リソース(170/195)、GitHubリポジトリ(137/195)、Hugging Faceデータセット(96/195)、および既知の場所メタデータを持つベンチマーク間のarXivプリプリントへの強い依存を報告している。メートル法レベルでは、精度、F1スコア、安全スコア、総合ベンチマークスコアなどのよく知られたラベルが、しばしば実質的な異なる判断、集約ルール、脅威モデルを隠すことを示している。フィールドの主な障害モードは、不足ではなく断片化である、と我々は主張する。研究者は現在、多くのベンチマークアーティファクトを持っているが、共有測定言語、ベンチマーク選択の原則ベース、出版後のメンテナンスのための耐久性の高いスチュワードシップ規範を欠いていることが多い。 AISafetyBenchExplorerは、トレース可能なベンチマークカタログ、コントロールされたメタデータスキーマ、さらに厳密なベンチマーク発見、比較、メタ評価をサポートする複雑さの分類を提供することで、このギャップに対処する。

論文の概要: AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance

関連論文リスト