Fugu-MT 論文翻訳(概要): Emergent evaluation hubs in a decentralizing large language model ecosystem

論文の概要: Emergent evaluation hubs in a decentralizing large language model ecosystem

arxiv url: http://arxiv.org/abs/2510.01286v1
Date: Tue, 30 Sep 2025 23:49:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.790687
Title: Emergent evaluation hubs in a decentralizing large language model ecosystem
Title（参考訳）: 分散化大規模言語モデルエコシステムにおける創発的評価ハブ
Authors: Manuel Cebrian, Tomomi Kito, Raul Castro Fernandez,
Abstract要約: 大規模な言語モデルは増加しており、ベンチマークも一般的なヤードスティックとして機能している。これら2つのレイヤの集合パターンがどのように比較されるのかを問う。相補的ではあるが、対照的なダイナミクスを見つけます。
参考スコア（独自算出の注目度）: 4.5311655360445515
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are proliferating, and so are the benchmarks that serve as their common yardsticks. We ask how the agglomeration patterns of these two layers compare: do they evolve in tandem or diverge? Drawing on two curated proxies for the ecosystem, the Stanford Foundation-Model Ecosystem Graph and the Evidently AI benchmark registry, we find complementary but contrasting dynamics. Model creation has broadened across countries and organizations and diversified in modality, licensing, and access. Benchmark influence, by contrast, displays centralizing patterns: in the inferred benchmark-author-institution network, the top 15% of nodes account for over 80% of high-betweenness paths, three countries produce 83% of benchmark outputs, and the global Gini for inferred benchmark authority reaches 0.89. An agent-based simulation highlights three mechanisms: higher entry of new benchmarks reduces concentration; rapid inflows can temporarily complicate coordination in evaluation; and stronger penalties against over-fitting have limited effect. Taken together, these results suggest that concentrated benchmark influence functions as coordination infrastructure that supports standardization, comparability, and reproducibility amid rising heterogeneity in model production, while also introducing trade-offs such as path dependence, selective visibility, and diminishing discriminative power as leaderboards saturate.
Abstract（参考訳）: 大規模な言語モデルは増加しており、ベンチマークも一般的なヤードスティックとして機能している。これら2つのレイヤの凝集パターンはどのように比較されるのか? エコシステムのための2つのキュレートされたプロキシ、Stanford Foundation-Model Ecosystem GraphとEvidently AIベンチマークレジストリに基づいて、補完的だが対照的なダイナミクスを見つけました。モデル作成は国や組織に広まり、モダリティ、ライセンス、アクセスに多様化している。一方、ベンチマークの影響は中央集権化パターンを示す: 推論されたベンチマークオーサリングのネットワークでは、上位15%のノードが高い相互性パスの80%以上を占め、3つの国が83%のベンチマークアウトプットを生成し、推定されたベンチマークオーソリティのグローバルGiniは0.89に達した。エージェントベースのシミュレーションでは、3つのメカニズムが強調されている。新しいベンチマークのより高いエントリーは集中度を低下させ、迅速なインフローは一時的に調整を複雑にし、過剰適合に対するより強い罰則は限定的な効果をもたらす。これらの結果は、モデル生産における不均一性の増大に伴い、標準化、可視性、再現性をサポートする調整基盤として、集中ベンチマークが機能し、一方で、経路依存、選択的可視性、リーダーボードが飽和するにつれて差別力の低下といったトレードオフを導入することを示唆している。

論文の概要: Emergent evaluation hubs in a decentralizing large language model ecosystem

関連論文リスト