Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For
Large Language Models
- URL: http://arxiv.org/abs/2305.15074v3
- Date: Mon, 23 Oct 2023 11:55:58 GMT
- Title: Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For
Large Language Models
- Authors: Daman Arora, Himanshu Gaurav Singh, Mausam
- Abstract summary: We present 515Bench, a more challenging benchmark dataset for evaluating the problem solving abilities of large language models (LLMs)
We curate challenging pre-engineering mathematics, physics and chemistry problems from the highly competitive IIT-Advanced exam.
Our evaluation on various open-source and proprietary models reveals that the highest performance, even after using techniques like self-consistency, self-refinement and chain-of-thought prompting, is less than 40%.
- Score: 23.344490944210456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The performance of large language models (LLMs) on existing reasoning
benchmarks has significantly improved over the past years. In response, we
present JEEBench, a considerably more challenging benchmark dataset for
evaluating the problem solving abilities of LLMs. We curate 515 challenging
pre-engineering mathematics, physics and chemistry problems from the highly
competitive IIT JEE-Advanced exam. Long-horizon reasoning on top of deep
in-domain knowledge is essential for solving problems in this benchmark. Our
evaluation on various open-source and proprietary models reveals that the
highest performance, even after using techniques like self-consistency,
self-refinement and chain-of-thought prompting, is less than 40%. The typical
failure modes of GPT-4, the best model, are errors in algebraic manipulation,
difficulty in grounding abstract concepts into mathematical equations
accurately and failure in retrieving relevant domain-specific concepts. We also
observe that by mere prompting, GPT-4 is unable to assess risk introduced by
negative marking for incorrect answers. For this, we develop a post-hoc
confidence-thresholding method over self-consistency, which enables effective
response selection. We hope that our challenging benchmark will guide future
re-search in problem-solving using LLMs.
Related papers
- HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics [1.5716764919736026]
We introduce HARDMath, a dataset featuring challenging applied mathematics problems that require analytical approximation techniques.
Our framework auto-generates a large number of problems with solutions validated against numerical ground truths.
We evaluate both open- and closed-source LLMs on HARDMath-mini, a sub-sampled test set of 366 problems, as well as on 40 word problems formulated in applied science contexts.
arXiv Detail & Related papers (2024-10-13T20:09:41Z) - BEATS: Optimizing LLM Mathematical Capabilities with BackVerify and Adaptive Disambiguate based Efficient Tree Search [22.672130194493793]
Large Language Models (LLMs) have exhibited exceptional performance across a broad range of tasks and domains.
They still encounter difficulties in solving mathematical problems due to the rigorous and logical nature of mathematics.
We propose a novel approach, BEATS, to enhance mathematical problem-solving abilities.
arXiv Detail & Related papers (2024-09-26T15:47:42Z) - We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? [11.858791083851447]
We introduce WE-MATH, the first benchmark designed to explore the problem-solving principles beyond end-to-end performance.
We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity.
We conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance.
arXiv Detail & Related papers (2024-07-01T13:39:08Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - Can Language Models Solve Olympiad Programming? [40.54366634332231]
This paper introduces the USACO benchmark with 307 problems from the USA Computing Olympiad.
We construct and test a range of LM inference methods for competitive programming for the first time.
We find GPT-4 only achieves a 8.7% pass@1 accuracy with zero-shot chain-of-thought prompting.
arXiv Detail & Related papers (2024-04-16T23:27:38Z) - GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks.
One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.
This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z) - Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models [81.27391252152199]
Large language models (LLMs) have achieved impressive performance across various natural language benchmarks.
We propose to automate dataset updating and provide systematic analysis regarding its effectiveness.
There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, and 2) extending strategy that further expands existing samples.
arXiv Detail & Related papers (2024-02-19T07:15:59Z) - CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities [25.857946070979576]
Concept and Hint-Annotated Math Problems (CHAMP) consists of high school math competition problems annotated with concepts.
This benchmark is difficult, with the best model only scoring 58.1% in standard settings.
We find that models often arrive at the correct final answer through wrong reasoning steps.
arXiv Detail & Related papers (2024-01-13T03:18:16Z) - Competition-Level Problems are Effective LLM Evaluators [121.15880285283116]
This paper aims to evaluate the reasoning capacities of large language models (LLMs) in solving recent programming problems in Codeforces.
We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered.
Surprisingly, theThoughtived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems.
arXiv Detail & Related papers (2023-12-04T18:58:57Z) - SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM)
SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains.
The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.