Fugu-MT 論文翻訳(概要): A Survey on Large Language Model Benchmarks

論文の概要: A Survey on Large Language Model Benchmarks

arxiv url: http://arxiv.org/abs/2508.15361v1
Date: Thu, 21 Aug 2025 08:43:35 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-22 16:26:46.242998
Title: A Survey on Large Language Model Benchmarks
Title（参考訳）: 大規模言語モデルベンチマークに関する調査
Authors: Shiwen Ni, Guhong Chen, Shuaimin Li, Xuanang Chen, Siyi Li, Bingli Wang, Qiyao Wang, Xingjian Wang, Yifan Zhang, Liyang Fan, Chengming Li, Ruifeng Xu, Le Sun, Min Yang,
Abstract要約: 一般的な能力ベンチマークは、中核言語学、知識、推論などの側面をカバーする。ドメイン固有のベンチマークは、自然科学、人文科学、社会科学、エンジニアリング技術といった分野に焦点を当てている。ターゲット固有のベンチマークは、リスク、信頼性、エージェントなどに注意を払う。
参考スコア（独自算出の注目度）: 45.042853171973086
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, with the rapid development of the depth and breadth of large language models' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model performance, benchmarks are not only a core means to measure model capabilities but also a key element in guiding the direction of model development and promoting technological innovation. We systematically review the current status and development of large language model benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning; domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology; target-specific benchmarks pay attention to risks, reliability, agents, etc. We point out that current benchmarks have problems such as inflated scores caused by data contamination, unfair evaluation due to cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments, and provide a referable design paradigm for future benchmark innovation.
Abstract（参考訳）: 近年,大規模言語モデルの能力の深度と広さの急速な発展に伴い,様々な評価ベンチマークが増加傾向にある。モデルパフォーマンスの定量的評価ツールとして、ベンチマークはモデル能力を測定するための中核的な手段であるだけでなく、モデル開発の方向性を導き、技術革新を促進する重要な要素でもある。大規模言語モデルベンチマークの現状と開発を体系的にレビューし,283の代表的なベンチマークを汎用性,ドメイン固有性,ターゲット固有性という3つのカテゴリに分類した。一般的な能力ベンチマークは、中核言語学、知識、推論といった側面をカバーする; ドメイン固有のベンチマークは、自然科学、人文科学、社会科学、エンジニアリング技術といった分野に焦点を当てる; ターゲット固有のベンチマークは、リスク、信頼性、エージェントなどに注意を払う。現在のベンチマークには、データ汚染による膨らませたスコア、文化的・言語的バイアスによる不公平な評価、プロセスの信頼性と動的環境に対する評価の欠如、将来のベンチマーク革新のための参照可能な設計パラダイムなどの問題があると指摘する。

論文の概要: A Survey on Large Language Model Benchmarks

関連論文リスト