Fugu-MT 論文翻訳(概要): Knowledge Index of Noah's Ark

論文の概要: Knowledge Index of Noah's Ark

arxiv url: http://arxiv.org/abs/2606.05104v2
Date: Thu, 04 Jun 2026 05:37:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 19:21:33.396619
Title: Knowledge Index of Noah's Ark
Title（参考訳）: Noah's Arkの知識指標
Authors: Sheng Jin, Minghao Liu, Yunze Xiao, Zeqi Zhou, Heli Qi, Yifan Yao, Meishu Song, Kaijing Ma, Xuan Zhang, Sicong Jiang, Yizhe Li, Ningshan Ma, Jie Wei, Ziniu Li, Minglai Yang, Bangya Liu, Yiming Liang, Xiao Fang, Qingcheng Zeng, Jiarui Liu, Rui Yang, Shen Yan, Wenhao Huang, Jiaheng Liu, Zihan Wang, Weihao Xuan, Ge Zhang,
Abstract要約: KINAは,261分野にわたる899項目のベンチマークである。ボーナス・オン・バートーナメントがFOSDを弱く支配していることを示す。トップモデルであるGemini-3.1-Pro-Previewは53.17%、Claude-Opus-4.6は49.92%、GPT-5.4は48.55%に達した。
参考スコア（独自算出の注目度）: 63.143852586221534
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors and operationalize disciplinary representativeness through a proxy, yielding a (1-1/e) greedy approximation (Proposition 1); the guarantee applies to the proxy, not to population representativeness. Second, we prove a bonus-on-bar tournament weakly FOSD-dominates flat payment in released-review quality, with incentive-compatibility threshold B > Delta C / Delta p_min (Theorem 1). Evaluating 42 models from 13 labs, the top model, Gemini-3.1-Pro-Preview, reaches 53.17%, followed by Claude-Opus-4.6 at 49.92% and GPT-5.4 at 48.55%, leaving substantial headroom below saturation. The full leaderboard shows a tiered structure rather than a smooth total order: a small frontier tier lies above 48%, a dense strong-model tier spans roughly 38-45%, and low-performing models remain only modestly above the 10% chance baseline. Tool augmentation adds up to 5.17 points across the five tool-use evaluations, with gains varying substantially across models. We report bootstrap ranking-stability statistics to make bounded-budget variance explicit and to discourage over-interpretation of adjacent ranks.
Abstract（参考訳）: LLMの知識ベンチマークは3つの問題に直面している: ディシプリナの代表性を運用しないスケーリング駆動設計、遅延コンセンサスを許容するフラットペイメントアノテーション、境界テスト予算下でのランキング不安定性。 KINAは261分野にわたる899項目のベンチマークで,2つの公式な結果を得た。まず,代用代用代用代用代用代用代用代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入代入第2に,FOSD-dominates flat payment in release-review quality, with incentive-compatibility threshold B > Delta C / Delta p_min (Theorem 1)。トップモデルのジェミニ-3.1-Pro-Previewが53.17%、クロード-オプス-4.6が49.92%、GPT-5.4が48.55%と評価された。小さなフロンティア層が48%以上、密度の高い強いモデル層が約38-45%、低い性能のモデルが10%の確率ベースラインよりわずかに上回っている。ツール拡張は5つのツール使用評価で5.17ポイントまで増加し、モデルによって大きく異なる。本稿では,境界予算分散を明示し,隣接するランクの過度な解釈を阻止するためのブートストラップランキング安定統計を報告する。

論文の概要: Knowledge Index of Noah's Ark

関連論文リスト