Related papers: EvolMathEval: Towards Evolvable Benchmarks for Mathematical Reasoning via Evolutionary Testing

EvolMathEval: Towards Evolvable Benchmarks for Mathematical Reasoning via Evolutionary Testing

URL: http://arxiv.org/abs/2508.13003v2
Date: Sun, 05 Oct 2025 08:41:52 GMT
Title: EvolMathEval: Towards Evolvable Benchmarks for Mathematical Reasoning via Evolutionary Testing
Authors: Shengbo Wang, Mingwei Liu, Zike Li, Anji Li, Yanlin Wang, Xin Peng, Zibin Zheng,
Abstract summary: EvolMathEval is an automated mathematical benchmark generation and evolution framework based on evolutionary testing.<n>It can generate a large volume of high-difficulty problems through continuous self-iteration.<n>It can also significantly enhance the complexity of public datasets like GSM8K through evolution, reducing model accuracy by an average of 48%.
Score: 45.89558878854675
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid advancement of Large Language Models (LLMs) poses a significant challenge to existing mathematical reasoning benchmarks. However, these benchmarks tend to become easier over time as LLMs can learn from the published benchmarks. This limitation hinder the precise evaluation of the true capabilities of SOTA models. To address this challenge, this paper introduces EvolMathEval, an automated mathematical benchmark generation and evolution framework based on evolutionary testing. Experimental results demonstrate that EvolMathEval can not only generate a large volume of high-difficulty problems through continuous self-iteration, but it can also significantly enhance the complexity of public datasets like GSM8K through evolution, reducing model accuracy by an average of 48\%. Deeper investigation reveals that when solving these evolved problems, LLMs tend to bypass complex multi-step logical reasoning by relying on simplistic and fuzzy conditions, consequently leading to incorrect solutions. We define this phenomenon as the ``Pseudo Aha Moment", which we find accounts for 77\% to 100\% of errors on targeted problems. Code and resources are available at: https://anonymous.4open.science/r/EvolMathEval

Related papers

DéjàQ: Open-Ended Evolution of Diverse, Learnable and Verifiable Problems [19.381443841718596]
We introduce DéjQ, a framework that evolves a diverse set of synthetic mathematical problems alongside model training.<n>This evolutionary process adapts to the model's ability throughout training, optimising problems for learnability.<n>We find that the model can generate novel and meaningful problems, and that these LLM-driven mutations improve RL training.
arXiv Detail & Related papers (2026-01-05T09:27:49Z)
RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning [26.91583214616048]
Large language models (LLMs) achieve high performance on mathematical reasoning.<n>Current rule-based perturbation methods often generate ill-posed questions.<n>We propose RIDE, a novel adversarial question-rewriting framework.
arXiv Detail & Related papers (2025-11-06T07:10:17Z)
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization [103.74675519953898]
Long-chain reflective reasoning is a prerequisite for solving complex real-world problems.<n>We build a benchmark consisting 1,260 samples of 42 challenging synthetic tasks.<n>We generate post-training data and explore learning paradigms for exploiting such data.
arXiv Detail & Related papers (2025-10-09T17:53:58Z)
Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding [59.60915947702282]
Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in enhancing the reasoning capabilities of large language models (LLMs)<n>Existing RLVR methods often suffer from exploration inefficiency due to mismatches between the training data's difficulty and the model's capability.<n>We propose SEELE, a novel supervision-aided RLVR framework that dynamically adjusts problem difficulty to stay within the high-efficiency region.
arXiv Detail & Related papers (2025-09-08T17:36:21Z)
OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization [88.76091817642963]
Recent large-scale language models (LLMs) with long Chain-of-such reasoning as DeepSeek-R1-have achieved impressive results on Olympiad-level mathematics.<n>We introduce OMEGA-Out-of-distribution Math Problems Evaluation with 3 Generalization Axes-a benchmark designed to evaluate three axes of out-of-distribution generalization.
arXiv Detail & Related papers (2025-06-23T17:51:40Z)
RV-Syn: Rational and Verifiable Mathematical Reasoning Data Synthesis based on Structured Function Library [58.404895570822184]
RV-Syn is a novel mathematical Synthesis approach.<n>It generates graphs as solutions by combining Python-formatted functions from this library.<n>Based on the constructed graph, we achieve solution-guided logic-aware problem generation.
arXiv Detail & Related papers (2025-04-29T04:42:02Z)
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs.<n>OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z)
SuperARC: An Agnostic Test for Narrow, General, and Super Intelligence Based On the Principles of Recursive Compression and Algorithmic Probability [0.14061979259370275]
We introduce an open-ended test grounded in algorithmic probability.<n>It can avoid benchmark contamination in the quantitative evaluation of frontier models.<n>We prove that compression is equivalent and directly proportional to a system's predictive power.
arXiv Detail & Related papers (2025-03-20T23:11:30Z)
MetaLadder: Ascending Mathematical Solution Quality via Analogical-Problem Reasoning Transfer [37.81465564673498]
Large Language Models (LLMs) have demonstrated promising capabilities in solving mathematical reasoning tasks.<n>We propose textbfMetaLadder, a framework that explicitly prompts LLMs to recall and reflect on meta-problems.<n>Our experiments on mathematical benchmarks demonstrate that our MetaLadder significantly boosts LLMs' problem-solving accuracy.
arXiv Detail & Related papers (2025-03-19T04:36:35Z)
PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [59.920971312822736]
We introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems.<n>The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction.<n>Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods.
arXiv Detail & Related papers (2025-03-04T06:32:30Z)
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving [90.88021670297664]
FINEREASON is a logic-puzzle benchmark for evaluation of large language models' reasoning capabilities.<n>We introduce two tasks: state checking, and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move.<n>We show that models trained on our state checking and transition data demonstrate gains in math reasoning by up to 5.1% on GSM8K.
arXiv Detail & Related papers (2025-02-27T16:23:25Z)
Iterative Deepening Sampling as Efficient Test-Time Scaling [27.807695570974644]
Recent reasoning models, such as OpenAI's O1 series, have demonstrated exceptional performance on complex reasoning tasks.<n>We propose a novel iterative deepening sampling algorithm framework designed to enhance self-correction and generate higher-quality samples.
arXiv Detail & Related papers (2025-02-08T04:39:51Z)
An Evolutionary Large Language Model for Hallucination Mitigation [0.0]
We propose EvoLLMs, which automates the generation of high-quality Question-answering datasets while minimizing hallucinations.<n>EvoLLMs consistently outperforms human-generated datasets in key metrics such as Depth, Relevance, and Coverage.<n>These results highlight EvoLLMs as a robust and efficient solution for QA dataset generation, significantly reducing the time and resources required for manual curation.
arXiv Detail & Related papers (2024-12-03T19:40:13Z)
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection. ErrorRadar evaluates two sub-tasks: error step identification and error categorization. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z)
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z)
Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models [23.344490944210456]
We present 515Bench, a more challenging benchmark dataset for evaluating the problem solving abilities of large language models (LLMs) We curate challenging pre-engineering mathematics, physics and chemistry problems from the highly competitive IIT-Advanced exam. Our evaluation on various open-source and proprietary models reveals that the highest performance, even after using techniques like self-consistency, self-refinement and chain-of-thought prompting, is less than 40%.
arXiv Detail & Related papers (2023-05-24T11:55:59Z)
Generalization of Neural Combinatorial Solvers Through the Lens of Adversarial Robustness [68.97830259849086]
Most datasets only capture a simpler subproblem and likely suffer from spurious features. We study adversarial robustness - a local generalization property - to reveal hard, model-specific instances and spurious features. Unlike in other applications, where perturbation models are designed around subjective notions of imperceptibility, our perturbation models are efficient and sound. Surprisingly, with such perturbations, a sufficiently expressive neural solver does not suffer from the limitations of the accuracy-robustness trade-off common in supervised learning.
arXiv Detail & Related papers (2021-10-21T07:28:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.