VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks
- URL: http://arxiv.org/abs/2507.12885v1
- Date: Thu, 17 Jul 2025 08:10:55 GMT
- Title: VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks
- Authors: Jian Yao, Ran Cheng, Kay Chen Tan,
- Abstract summary: emphbenchmark contamination arises from the public availability of test problems.<n>emphevaluation fragility stems from the reliance on single-instance assessments.<n>IME-MATH is a symbolic evaluation framework designed to probe genuine reasoning ability.
- Score: 25.295071827427677
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in reinforcement learning (RL) have led to substantial improvements in the mathematical reasoning abilities of large language models (LLMs), as measured by standard benchmarks. However, these gains often persist even when models are trained with flawed signals, such as random or inverted rewards, raising a fundamental question: do such improvements reflect true reasoning, or are they merely artifacts of overfitting to benchmark-specific patterns? To address this question, we take an evaluation-centric perspective and identify two critical shortcomings in existing protocols. First, \emph{benchmark contamination} arises from the public availability of test problems, increasing the risk of data leakage. Second, \emph{evaluation fragility} stems from the reliance on single-instance assessments, which are highly sensitive to stochastic outputs and fail to capture reasoning consistency. To overcome these limitations, we introduce {VAR-MATH}, a symbolic evaluation framework designed to probe genuine reasoning ability. By converting fixed numerical problems into symbolic templates and requiring models to solve multiple instantiations of each, VAR-MATH enforces consistent reasoning across structurally equivalent variants, thereby mitigating contamination and improving evaluation robustness. We apply VAR-MATH to transform two popular benchmarks, AMC23 and AIME24, into their symbolic counterparts, VAR-AMC23 and VAR-AIME24. Experimental results reveal substantial performance drops for RL-trained models on the variabilized versions, especially for smaller models, with average declines of 48.0\% on AMC23 and 58.3\% on AIME24. These findings suggest that many existing RL methods rely on superficial heuristics and fail to generalize beyond specific numerical forms. Overall, VAR-MATH offers a principled, contamination-resistant evaluation paradigm for mathematical reasoning.
Related papers
- REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once [33.049237516125146]
We present REST (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that exposes Large Reasoning Models to multiple problems simultaneously.<n>Our evaluation reveals several striking findings: Even state-of-the-art (SOTA) models like DeepSeek-R1 exhibit substantial performance degradation under stress testing.
arXiv Detail & Related papers (2025-07-14T17:58:47Z) - Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination [68.54308967669795]
We show that Qwen2.5 achieves strong mathematical reasoning performance, but its pretraining on large-scale web corpora makes it vulnerable to data contamination.<n>We introduce a generator that produces fully synthetic arithmetic problems of arbitrary length and difficulty, yielding a clean dataset we call RandomCalculation.<n>Using these leakage-free datasets, we show that only accurate reward signals consistently improve performance, while noisy or incorrect signals do not.
arXiv Detail & Related papers (2025-07-14T17:55:15Z) - RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation [15.205635488139043]
We introduce RE-IMAGINE, a framework to characterize a hierarchy of reasoning ability in Large Language Models (LLMs)<n>By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone.<n>We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations.
arXiv Detail & Related papers (2025-06-18T13:35:47Z) - T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation [60.620408007636016]
We propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores.<n>Our approach integrates Group Relative Policy Optimization into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains.
arXiv Detail & Related papers (2025-05-23T13:44:59Z) - Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z) - A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility [29.437125712259046]
Reasoning has emerged as the next major frontier for language models (LMs)<n>We conduct a comprehensive empirical study and find that current mathematical reasoning benchmarks are highly sensitive to subtle implementation choices.<n>We propose a standardized evaluation framework with clearly defined best practices and reporting standards.
arXiv Detail & Related papers (2025-04-09T17:58:17Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - A comprehensive comparative evaluation and analysis of Distributional
Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT.
The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous.
We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.