Related papers: HARP: A challenging human-annotated math reasoning benchmark

HARP: A challenging human-annotated math reasoning benchmark

URL: http://arxiv.org/abs/2412.08819v1
Date: Wed, 11 Dec 2024 23:31:06 GMT
Title: HARP: A challenging human-annotated math reasoning benchmark
Authors: Albert S. Yue, Lovish Madaan, Ted Moskovitz, DJ Strouse, Aaditya K. Singh,
Abstract summary: We introduce HARP, Human Annotated Reasoning Problems (for Math), consisting of 5,409 problems from the US national math competitions (A(J)HSME, AMC, AIME, USA(J)MO).<n>Of these, 4,780 have answers that are automatically check-able (with libraries such as SymPy).<n>These problems range six difficulty levels, with frontier models performing relatively poorly on the hardest bracket of 197 problems (average accuracy 41.1% for o1-mini, and 9.6% for Gemini 1.5 Pro).<n>Our dataset also features multiple choices (for 4,110 problems) and an average of two human-written
Score: 7.691786865279827
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Math reasoning is becoming an ever increasing area of focus as we scale large language models. However, even the previously-toughest evals like MATH are now close to saturated by frontier models (90.0% for o1-mini and 86.5% for Gemini 1.5 Pro). We introduce HARP, Human Annotated Reasoning Problems (for Math), consisting of 5,409 problems from the US national math competitions (A(J)HSME, AMC, AIME, USA(J)MO). Of these, 4,780 have answers that are automatically check-able (with libraries such as SymPy). These problems range six difficulty levels, with frontier models performing relatively poorly on the hardest bracket of 197 problems (average accuracy 41.1% for o1-mini, and 9.6% for Gemini 1.5 Pro). Our dataset also features multiple choices (for 4,110 problems) and an average of two human-written, ground-truth solutions per problem, offering new avenues of research that we explore briefly. We report evaluations for many frontier models and share some interesting analyses, such as demonstrating that frontier models across families intrinsically scale their inference-time compute for more difficult problems. Finally, we open source all code used for dataset construction (including scraping) and all code for evaluation (including answer checking) to enable future research at: https://github.com/aadityasingh/HARP.

Related papers

BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models [13.380359214677176]
We introduce BeyondBench, an evaluation framework that avoids contamination from internet-scale training data.<n>Our framework covers 44 algorithmic tasks with a total of 117 variations, grouped into three difficulty levels.<n>We evaluate 101 language models, including 85 open-source and 16 closed-source models, spanning sizes from 0.5B to 141B parameters.
arXiv Detail & Related papers (2025-09-29T02:49:01Z)
WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning [51.13280433665446]
Large language models (LLMs) excel at general mathematical reasoning but fail catastrophically on specialized technical mathematics.<n>In wireless communications, where problems require precise manipulation of information-theoretic bounds, even state-of-the-art models struggle to achieve competent performance.<n>We present WirelessMathLM, demonstrating that compact models (0.5B-7B parameters) can match or exceed much larger models through domain-specific reinforcement learning.
arXiv Detail & Related papers (2025-09-27T09:58:03Z)
STORM-BORN: A Challenging Mathematical Derivations Dataset Curated via a Human-in-the-Loop Multi-Agent Framework [25.911928883289512]
STORM-BORN is an ultra-challenging dataset of mathematical derivations sourced from cutting-edge academic papers.<n>We curated a set of 2,000 synthetic samples and deliberately selected the 100 most difficult problems.<n>Fine-tuning on STORM-BORN boosts accuracy by 7.84% (LLaMA3-8B) and 9.12% (Qwen2.5-7B)
arXiv Detail & Related papers (2025-06-02T10:48:49Z)
Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT? [59.418994222096885]
We conduct a detailed analysis of model performance on the AIME24 dataset. We categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard) We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT-1K instances. Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills.
arXiv Detail & Related papers (2025-04-16T03:39:38Z)
SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers [17.326575243638437]
Large Language Models (LLMs) have achieved human-level proficiency across diverse tasks, but their ability to perform rigorous mathematical problem solving remains an open challenge. In this work, we investigate a fundamental yet intractable problem: determining whether a given 1.8% is nonnegative. Our findings highlight the potential of LLMs to push the boundaries of mathematical reasoning and tackle NP-hard problems.
arXiv Detail & Related papers (2025-02-27T21:41:43Z)
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [90.07275414500154]
We observe significant performance drops on MATH-P-Hard across various models. We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills.
arXiv Detail & Related papers (2025-02-10T13:31:46Z)
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference. This paper presents the first comprehensive study on the prevalent issue of overthinking in these models. We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z)
ProcessBench: Identifying Process Errors in Mathematical Reasoning [62.80402845414901]
We introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning.<n>ProcessBench consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems.<n>We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models.
arXiv Detail & Related papers (2024-12-09T15:11:40Z)
UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts [7.856746367263317]
This paper introduces the UTMath Benchmark, a robust evaluation framework designed to assess Large Language Models. It comprises 1,053 cutting-edge problems spanning nine mathematical domains, with an average of 68 test cases per problem. The best-performing model, o1-mini, solving only 32.57% of the problems, followed by o1-preview at 27.16%, and GPT-4o at 26.93%.
arXiv Detail & Related papers (2024-11-11T18:59:02Z)
Uncovering Weaknesses in Neural Code Generation [21.552898575210534]
We assess the quality of generated code using match-based and execution-based metrics, then conduct thematic analysis to develop a taxonomy of nine types of weaknesses. In the CoNaLa dataset, inaccurate prompts are a notable problem, causing all large models to fail in 26.84% of cases. Missing pivotal semantics is a pervasive issue across benchmarks, with one or more large models omitting key semantics in 65.78% of CoNaLa tasks. All models struggle with proper API usage, a challenge amplified by vague or complex prompts.
arXiv Detail & Related papers (2024-07-13T07:31:43Z)
Can Language Models Solve Olympiad Programming? [40.54366634332231]
This paper introduces the USACO benchmark with 307 problems from the USA Computing Olympiad. We construct and test a range of LM inference methods for competitive programming for the first time. We find GPT-4 only achieves a 8.7% pass@1 accuracy with zero-shot chain-of-thought prompting.
arXiv Detail & Related papers (2024-04-16T23:27:38Z)
Common 7B Language Models Already Possess Strong Math Capabilities [61.61442513067561]
This paper shows that the LLaMA-2 7B model with common pre-training already exhibits strong mathematical abilities. The potential for extensive scaling is constrained by the scarcity of publicly available math questions.
arXiv Detail & Related papers (2024-03-07T18:00:40Z)
GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving [40.46491587796371]
We introduce the GeoEval benchmark, a comprehensive collection that includes a main subset of 2,000 problems, a 750 problems subset focusing on backward reasoning, an augmented subset of 2,000 problems, and a hard subset of 300 problems. Our evaluation of ten LLMs and MMs across these varied subsets reveals that the WizardMath model excels, achieving a 55.67% accuracy rate on the main subset but only a 6.00% accuracy on the hard subset.
arXiv Detail & Related papers (2024-02-15T16:59:41Z)
TinyGSM: achieving >80% on GSM8k with small language models [49.21136294791747]
Small-scale models offer various computational advantages, and yet to which extent size is critical for problem-solving abilities remains an open question. Specifically for solving grade school math, the smallest model size so far required to break the 80% barrier on the GSM8K benchmark remains to be 34B. Our work studies how high-quality datasets may be the key for small language models to acquire mathematical reasoning.
arXiv Detail & Related papers (2023-12-14T18:58:28Z)
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning [52.97768001837269]
We present a method to fine-tune open-source language models, enabling them to use code for modeling and deriving math equations. We propose a method of generating novel and high-quality datasets with math problems and their code-based solutions. This approach yields the MathCoder models, a family of models capable of generating code-based solutions for solving challenging math problems.
arXiv Detail & Related papers (2023-10-05T17:52:09Z)
GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning [172.36214872466707]
We focus on solving geometric problems, which requires a comprehensive understanding of textual descriptions, visual diagrams, and theorem knowledge. We propose a Geometric Question Answering dataset GeoQA, containing 5,010 geometric problems with corresponding annotated programs.
arXiv Detail & Related papers (2021-05-30T12:34:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.