A Survey on Large Language Model Benchmarks
- URL: http://arxiv.org/abs/2508.15361v1
- Date: Thu, 21 Aug 2025 08:43:35 GMT
- Title: A Survey on Large Language Model Benchmarks
- Authors: Shiwen Ni, Guhong Chen, Shuaimin Li, Xuanang Chen, Siyi Li, Bingli Wang, Qiyao Wang, Xingjian Wang, Yifan Zhang, Liyang Fan, Chengming Li, Ruifeng Xu, Le Sun, Min Yang,
- Abstract summary: General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning.<n> domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology.<n>Target-specific benchmarks pay attention to risks, reliability, agents, etc.
- Score: 45.042853171973086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, with the rapid development of the depth and breadth of large language models' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model performance, benchmarks are not only a core means to measure model capabilities but also a key element in guiding the direction of model development and promoting technological innovation. We systematically review the current status and development of large language model benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning; domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology; target-specific benchmarks pay attention to risks, reliability, agents, etc. We point out that current benchmarks have problems such as inflated scores caused by data contamination, unfair evaluation due to cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments, and provide a referable design paradigm for future benchmark innovation.
Related papers
- Benchmark^2: Systematic Evaluation of LLM Benchmarks [66.2731798872668]
We propose Benchmark2, a comprehensive framework comprising three complementary metrics.<n>We conduct experiments across 15 benchmarks spanning mathematics, reasoning, and knowledge domains.<n>Our analysis reveals significant quality variations among existing benchmarks and demonstrates that selective benchmark construction can achieve comparable evaluation performance.
arXiv Detail & Related papers (2026-01-07T14:59:03Z) - The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation [1.2324085268373774]
We discuss whether surpassing a benchmark truly demonstrates reasoning ability or are we simply tracking numbers divorced from the capabilities we claim to measure?<n>We present an investigation focused on three model families, OpenAI, Anthropic, and Google, and how their reasoning capabilities evolve over the years.
arXiv Detail & Related papers (2025-11-03T09:09:29Z) - Towards Ecologically Valid LLM Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners [2.0388938295521575]
Benchmarks play a significant role in how researchers and the public understand generative AI systems.<n>The widespread use of benchmark scores to communicate about model capabilities has led to criticisms of validity.<n>In this work we explore how to create an LLM benchmark that addresses these issues by taking a human-centered approach.
arXiv Detail & Related papers (2025-09-30T21:36:23Z) - Deprecating Benchmarks: Criteria and Framework [2.6449913368815516]
We propose criteria to decide when to fully or partially deprecate benchmarks, and a framework for deprecating benchmarks.<n>Our work aims to advance the state of benchmarking towards rigorous and quality evaluations, especially for frontier models.
arXiv Detail & Related papers (2025-07-08T22:29:06Z) - Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.<n>A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy [27.454549324141087]
We propose a novel VQA benchmark based on well-known visual classification datasets.
We also suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category.
Our contributions aim to lay the foundation for more precise and meaningful assessments.
arXiv Detail & Related papers (2024-02-11T18:26:18Z) - Advancing the Evaluation of Traditional Chinese Language Models: Towards
a Comprehensive Benchmark Suite [17.764840326809797]
We propose a novel set of benchmarks that leverage existing English datasets and are tailored to evaluate language models in Traditional Chinese.
These benchmarks encompass a wide range of tasks, including contextual question-answering, summarization, classification, and table understanding.
In this paper, we evaluate the performance of GPT-3.5, Taiwan-LLaMa-v1.0, and Model 7-C, our proprietary model, on these benchmarks.
arXiv Detail & Related papers (2023-09-15T14:52:23Z) - MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models.
MMBench is meticulously curated with well-designed quality control schemes.
MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z) - AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models [122.63704560157909]
We introduce AGIEval, a novel benchmark designed to assess foundation model in the context of human-centric standardized exams.
We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003.
GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam.
arXiv Detail & Related papers (2023-04-13T09:39:30Z) - CUGE: A Chinese Language Understanding and Generation Evaluation
Benchmark [144.05723617401674]
General-purpose language intelligence evaluation has been a longstanding goal for natural language processing.
We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic.
We propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features.
arXiv Detail & Related papers (2021-12-27T11:08:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.