Fugu-MT 論文翻訳(概要): OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking

論文の概要: OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking

arxiv url: http://arxiv.org/abs/2511.08598v1
Date: Fri, 31 Oct 2025 16:44:34 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-16 06:38:31.068386
Title: OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking
Title（参考訳）: OKBench: 完全自動化,オンデマンド,オープンな知識ベンチマークによるLCM評価の民主化
Authors: Yanhong Li, Tianyang Xu, Kenan Tang, Karen Livescu, David McAllester, Jiawei Zhou,
Abstract要約: OKBenchは、ベンチマークのソーシング、作成、検証、配布を自動化するエージェントフレームワークである。これらの結果から,新たな情報に直面する場合のモデル行動が明らかになり,小型モデルと大規模モデルのパフォーマンスギャップがいかに狭まるかが明らかになった。
参考スコア（独自算出の注目度）: 47.579237867766686
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Knowledge-intensive question answering is central to large language models (LLMs) and is typically assessed using static benchmarks derived from sources like Wikipedia and textbooks. However, these benchmarks fail to capture evolving knowledge in a dynamic world, and centralized curation struggles to keep pace with rapid LLM advancements. To address these drawbacks, we propose Open Knowledge Bench (OKBench), a fully automated framework for generating high-quality, dynamic knowledge benchmarks on demand. Focusing on the news domain where knowledge updates daily, OKBench is an agentic framework that automates the sourcing, creation, validation, and distribution of benchmarks. Our approach democratizes benchmark creation and facilitates thorough evaluation of retrieval-augmented methods by reducing overlap with pretraining data. We evaluate our framework on a wide range open-source and proprietary LLMs of various sizes and configurations, both with and without retrieval over freshly generated knowledge. Our results reveal distinct model behaviors when confronted with new information and highlight how retrieval narrows the performance gap between small and large models. These findings underscore the importance of evaluating LLMs on evolving knowledge benchmarks.
Abstract（参考訳）: 知識集約的な質問応答は大きな言語モデル(LLM)の中心であり、典型的にはウィキペディアや教科書などの情報源から派生した静的ベンチマークを用いて評価される。しかし、これらのベンチマークは動的世界の進化する知識を捉えることができず、中央集権的なキュレーションはLLMの急速な進歩とペースを維持するのに苦労する。これらの欠点に対処するため、我々は要求に応じて高品質で動的な知識ベンチマークを生成するための完全に自動化されたフレームワークであるOpen Knowledge Bench (OKBench)を提案する。知識が毎日更新されるニュースドメインに注目して、OKBenchは、ベンチマークのソーシング、作成、検証、配布を自動化するエージェントフレームワークである。提案手法は,ベンチマーク作成を民主化し,事前学習データとの重複を低減し,検索強化手法の徹底的な評価を容易にする。我々は,新たに生成した知識を検索することなく,さまざまなサイズと構成の,幅広いオープンソースおよびプロプライエタリなLLM上で,我々のフレームワークを評価した。これらの結果から,新たな情報に直面する場合のモデル行動が明らかになり,小型モデルと大規模モデルのパフォーマンスギャップがいかに狭まるかが明らかになった。これらの知見は、LLMを進化的知識ベンチマークで評価することの重要性を浮き彫りにした。

論文の概要: OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking

関連論文リスト