AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
- URL: http://arxiv.org/abs/2304.06364v2
- Date: Mon, 18 Sep 2023 14:23:02 GMT
- Title: AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
- Authors: Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin
Wang, Amin Saied, Weizhu Chen and Nan Duan
- Abstract summary: We introduce AGIEval, a novel benchmark designed to assess foundation model in the context of human-centric standardized exams.
We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003.
GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam.
- Score: 122.63704560157909
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating the general abilities of foundation models to tackle human-level
tasks is a vital aspect of their development and application in the pursuit of
Artificial General Intelligence (AGI). Traditional benchmarks, which rely on
artificial datasets, may not accurately represent human-level capabilities. In
this paper, we introduce AGIEval, a novel benchmark specifically designed to
assess foundation model in the context of human-centric standardized exams,
such as college entrance exams, law school admission tests, math competitions,
and lawyer qualification tests. We evaluate several state-of-the-art foundation
models, including GPT-4, ChatGPT, and Text-Davinci-003, using this benchmark.
Impressively, GPT-4 surpasses average human performance on SAT, LSAT, and math
competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5%
accuracy on the English test of the Chinese national college entrance exam.
This demonstrates the extraordinary performance of contemporary foundation
models. In contrast, we also find that GPT-4 is less proficient in tasks that
require complex reasoning or specific domain knowledge. Our comprehensive
analyses of model capabilities (understanding, knowledge, reasoning, and
calculation) reveal these models' strengths and limitations, providing valuable
insights into future directions for enhancing their general capabilities. By
concentrating on tasks pertinent to human cognition and decision-making, our
benchmark delivers a more meaningful and robust evaluation of foundation
models' performance in real-world scenarios. The data, code, and all model
outputs are released in https://github.com/ruixiangcui/AGIEval.
Related papers
- Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming [22.344985623878408]
State-of-the-art models like GPT-4o and Llama3 barely match the performance of an average school student.
We fine-tune these models using a novel synthetic data generation methodology.
We will release the full implementation and datasets to facilitate further research on enhancing computational thinking in generative models.
arXiv Detail & Related papers (2024-06-14T10:02:52Z) - Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.
We show that ReasonEval achieves state-of-the-art performance on human-labeled datasets.
We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z) - OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems [62.06169250463104]
We present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions.
The best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics.
Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies.
arXiv Detail & Related papers (2024-02-21T18:49:26Z) - MathVista: Evaluating Mathematical Reasoning of Foundation Models in
Visual Contexts [170.01089233942594]
MathVista is a benchmark designed to combine challenges from diverse mathematical and visual tasks.
The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%.
GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning.
arXiv Detail & Related papers (2023-10-03T17:57:24Z) - Evaluating the Generation Capabilities of Large Chinese Language Models [27.598864484231477]
This paper unveils CG-Eval, the first-ever comprehensive and automated evaluation framework.
It assesses the generative capabilities of large Chinese language models across a spectrum of academic disciplines.
Gscore automates the quality measurement of a model's text generation against reference standards.
arXiv Detail & Related papers (2023-08-09T09:22:56Z) - ARB: Advanced Reasoning Benchmark for Large Language Models [94.37521840642141]
We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields.
As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge.
We evaluate recent models such as GPT-4 and Claude on ARB and demonstrate that current models score well below 50% on more demanding tasks.
arXiv Detail & Related papers (2023-07-25T17:55:19Z) - COM2SENSE: A Commonsense Reasoning Benchmark with Complementary
Sentences [21.11065466376105]
Commonsense reasoning is intuitive for humans but has been a long-term challenge for artificial intelligence (AI)
Recent advancements in pretrained language models have shown promising results on several commonsense benchmark datasets.
We introduce a new commonsense reasoning benchmark dataset comprising natural language true/false statements.
arXiv Detail & Related papers (2021-06-02T06:31:55Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.