MATH-Beyond: A Benchmark for RL to Expand Beyond the Base Model
- URL: http://arxiv.org/abs/2510.11653v1
- Date: Mon, 13 Oct 2025 17:30:54 GMT
- Title: MATH-Beyond: A Benchmark for RL to Expand Beyond the Base Model
- Authors: Prasanna Mayilvahanan, Ricardo Dominguez-Olmedo, Thaddäus Wiedemer, Wieland Brendel,
- Abstract summary: We introduce MATH-Beyond (MATH-B), a benchmark constructed to defeat common open-source models of up to 8B parameters under large sampling budgets.<n>Since the problems are drawn from subsets of DAPO-Math-17K and DeepScaleR datasets, they remain topically equivalent to standard high-school math.<n>We hope MATH-B will catalyze exploration-driven RL approaches that elicit deeper reasoning capabilities.
- Score: 30.62638603067356
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the advent of DeepSeek-R1, a new wave of reinforcement learning (RL) methods has emerged that seem to unlock stronger mathematical reasoning. However, a closer look at the open-source ecosystem reveals a critical limitation: with sufficiently many draws (e.g., $\texttt{pass@1024}$), many existing base models already solve nearly all questions on widely used math benchmarks such as MATH-500 and AIME 2024. This suggests that the RL fine-tuning methods prevalent in the LLM reasoning literature largely sharpen existing solution modes rather than discovering entirely new ones. Such sharpening stands in contrast to the broader promise of RL: to foster exploration and to acquire new skills. To move beyond this plateau, we introduce MATH-Beyond (MATH-B), a benchmark deliberately constructed to defeat common open-source models of up to 8B parameters even under large sampling budgets. Improving performance on our benchmark via RL requires methods that learn to reason in ways that go beyond base model capabilities in repeated sampling. Since the problems are drawn from subsets of DAPO-Math-17K and DeepScaleR datasets, they remain topically equivalent to standard high-school math. Validating our premise, RL fine-tuned models such as Nemotron-Research-Reasoning-Qwen-1.5B and DeepScaleR-1.5B-Preview perform poorly on MATH-B at $\texttt{pass@1024}$, showing how existing approaches fall short on tackling harder instances. We hope MATH-B will catalyze exploration-driven RL approaches that elicit deeper reasoning capabilities. We release MATH-B at https://huggingface.co/datasets/brendel-group/MATH-Beyond.
Related papers
- You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models [12.14455026524814]
We investigate the generalizability of label-free RL approaches to base models with limited reasoning capabilities.<n>We find that label-free RL is highly dependent on the base model's pre-existing reasoning capability.<n>We propose a simple yet effective method for label-free RL that utilizes curriculum learning to progressively introduce harder problems.
arXiv Detail & Related papers (2025-11-07T01:05:11Z) - Reasoning with Sampling: Your Base Model is Smarter Than You Think [52.639108524651846]
We propose a simple iterative sampling algorithm leveraging the base models' own likelihoods.<n>We show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL.<n>Our method does not require training, curated datasets, or a verifier.
arXiv Detail & Related papers (2025-10-16T17:18:11Z) - QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation [27.56280364505776]
Reinforcement learning (RL) has emerged as a central paradigm for training large language models (LLMs) in reasoning tasks.<n>Recent studies question RL's ability to incentivize reasoning capacity beyond the base model.<n>We propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty.
arXiv Detail & Related papers (2025-07-17T16:21:47Z) - Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning [75.31797502976802]
We evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks.<n>We find that most models that succeed in math fail to transfer their gains to other domains.<n>Our results suggest a need to rethink standard post-training recipes.
arXiv Detail & Related papers (2025-07-01T05:23:05Z) - ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models [89.37819814048288]
We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy, and a diverse suite of tasks.<n>Our empirical analysis reveals that RL-trained models consistently outperform base resetting models across a wide range of pass@k evaluations.<n>These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in language models.
arXiv Detail & Related papers (2025-05-30T17:59:01Z) - Maximizing Confidence Alone Improves Reasoning [48.83927980325788]
RENT: Reinforcement Learning via Entropy Minimization is a fully unsupervised RL method that requires no external reward or ground-truth answers.<n>We find that by reinforcing the chains of thought that yield high model confidence on its generated answers, the model improves its reasoning ability.
arXiv Detail & Related papers (2025-05-28T17:59:37Z) - AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z) - DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning [95.31714779585272]
DeepMath-103K is a large-scale mathematical dataset designed with high difficulty (primarily levels 5-9)<n>It includes rigorous decontamination against numerous benchmarks, and verifiable answers for rule-based RL reward.<n>DeepMath-103K fosters the development of generalizable and advancing reasoning.
arXiv Detail & Related papers (2025-04-15T17:59:51Z) - Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [65.2421542320293]
Reasoning abilities are crucial components of general intelligence.<n>Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks.<n>This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through textbfOutcome textbfREwtextbfArd-based reinforcement textbfLearning for mathematical reasoning tasks.
arXiv Detail & Related papers (2025-02-10T18:57:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.