ARB: Advanced Reasoning Benchmark for Large Language Models
- URL: http://arxiv.org/abs/2307.13692v2
- Date: Fri, 28 Jul 2023 03:31:08 GMT
- Title: ARB: Advanced Reasoning Benchmark for Large Language Models
- Authors: Tomohiro Sawada, Daniel Paleka, Alexander Havrilla, Pranav Tadepalli,
Paula Vidas, Alexander Kranias, John J. Nay, Kshitij Gupta, Aran Komatsuzaki
- Abstract summary: We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields.
As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge.
We evaluate recent models such as GPT-4 and Claude on ARB and demonstrate that current models score well below 50% on more demanding tasks.
- Score: 94.37521840642141
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated remarkable performance on
various quantitative reasoning and knowledge benchmarks. However, many of these
benchmarks are losing utility as LLMs get increasingly high scores, despite not
yet reaching expert performance in these domains. We introduce ARB, a novel
benchmark composed of advanced reasoning problems in multiple fields. ARB
presents a more challenging test than prior benchmarks, featuring problems in
mathematics, physics, biology, chemistry, and law. As a subset of ARB, we
introduce a challenging set of math and physics problems which require advanced
symbolic reasoning and domain knowledge. We evaluate recent models such as
GPT-4 and Claude on ARB and demonstrate that current models score well below
50% on more demanding tasks. In order to improve both automatic and assisted
evaluation capabilities, we introduce a rubric-based evaluation approach,
allowing GPT-4 to score its own intermediate reasoning steps. Further, we
conduct a human evaluation of the symbolic subset of ARB, finding promising
agreement between annotators and GPT-4 rubric evaluation scores.
Related papers
- GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering [0.0]
Retrieval-Augmented Generation (RAG) has emerged as a common paradigm to use Large Language Models (LLMs) alongside private and up-to-date knowledge bases.
We address the challenges of using LLM-as-a-Judge when evaluating grounded answers generated by RAG systems.
arXiv Detail & Related papers (2024-09-10T15:39:32Z) - Sources of Gain: Decomposing Performance in Conditional Average Dose Response Estimation [0.9332308328407303]
Estimating conditional average dose responses (CADR) is an important but challenging problem.
Our paper analyses this practice and shows that using popular benchmark datasets without further analysis is insufficient to judge model performance.
We propose a novel decomposition scheme that allows the evaluation of the impact of five distinct components contributing to CADR estimator performance.
arXiv Detail & Related papers (2024-06-12T13:39:32Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems [62.06169250463104]
We present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions.
The best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics.
Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies.
arXiv Detail & Related papers (2024-02-21T18:49:26Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For
Large Language Models [23.344490944210456]
We present 515Bench, a more challenging benchmark dataset for evaluating the problem solving abilities of large language models (LLMs)
We curate challenging pre-engineering mathematics, physics and chemistry problems from the highly competitive IIT-Advanced exam.
Our evaluation on various open-source and proprietary models reveals that the highest performance, even after using techniques like self-consistency, self-refinement and chain-of-thought prompting, is less than 40%.
arXiv Detail & Related papers (2023-05-24T11:55:59Z) - AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models [122.63704560157909]
We introduce AGIEval, a novel benchmark designed to assess foundation model in the context of human-centric standardized exams.
We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003.
GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam.
arXiv Detail & Related papers (2023-04-13T09:39:30Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.