MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy
- URL: http://arxiv.org/abs/2508.05592v1
- Date: Thu, 07 Aug 2025 17:32:14 GMT
- Title: MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy
- Authors: Shaoxiong Zhan, Yanlin Lai, Ziyu Lu, Dahua Lin, Ziqing Yang, Fei Tang,
- Abstract summary: MathSmith is a novel framework for challenging mathematical problems to enhance LLM reasoning.<n>Rather than modifying existing problems, MathSmith constructs new ones from scratch by randomly sampling concept-explanation pairs from PlanetMath.<n>To increase difficulty, we design nine predefined strategies as soft constraints during rationales.<n>Experiments show MathSmith consistently outperforms existing baselines under both short and long CoT settings.
- Score: 43.86485569038631
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models have achieved substantial progress in mathematical reasoning, yet their advancement is limited by the scarcity of high-quality, high-difficulty training data. Existing synthesis methods largely rely on transforming human-written templates, limiting both diversity and scalability. We propose MathSmith, a novel framework for synthesizing challenging mathematical problems to enhance LLM reasoning. Rather than modifying existing problems, MathSmith constructs new ones from scratch by randomly sampling concept-explanation pairs from PlanetMath, ensuring data independence and avoiding contamination. To increase difficulty, we design nine predefined strategies as soft constraints during rationales. We further adopts reinforcement learning to jointly optimize structural validity, reasoning complexity, and answer consistency. The length of the reasoning trace generated under autoregressive prompting is used to reflect cognitive complexity, encouraging the creation of more demanding problems aligned with long-chain-of-thought reasoning. Experiments across five benchmarks, categorized as easy & medium (GSM8K, MATH-500) and hard (AIME2024, AIME2025, OlympiadBench), show that MathSmith consistently outperforms existing baselines under both short and long CoT settings. Additionally, a weakness-focused variant generation module enables targeted improvement on specific concepts. Overall, MathSmith exhibits strong scalability, generalization, and transferability, highlighting the promise of high-difficulty synthetic data in advancing LLM reasoning capabilities.
Related papers
- WarriorMath: Enhancing the Mathematical Ability of Large Language Models with a Defect-aware Framework [42.74246647841103]
WarriorMath is a defect-aware framework for mathematical problem solving.<n>We employ multiple expert LLMs in a collaborative process to generate, critique, and refine problems.<n>In the training stage, we introduce a progressive learning framework that iteratively fine-tunes the model using increasingly challenging data tailored to its weaknesses.
arXiv Detail & Related papers (2025-08-02T07:45:12Z) - MathFusion: Enhancing Mathematical Problem-solving of LLM through Instruction Fusion [48.443460251524776]
MathFusion is a novel framework that enhances mathematical reasoning through cross-problem instruction synthesis.<n>MathFusion achieves substantial improvements in mathematical reasoning while maintaining high data efficiency.
arXiv Detail & Related papers (2025-03-20T15:00:41Z) - PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [59.920971312822736]
We introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems.<n>The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction.<n>Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods.
arXiv Detail & Related papers (2025-03-04T06:32:30Z) - MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [90.07275414500154]
We observe significant performance drops on MATH-P-Hard across various models.<n>We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills.
arXiv Detail & Related papers (2025-02-10T13:31:46Z) - MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human Curricula [33.5782208232163]
We propose Math CAMPS: a method to synthesize high-quality mathematical problems at scale.
We encode each standard in a formal grammar, allowing us to sample diverse symbolic problems and their answers.
We derive follow-up questions from symbolic structures and convert them into follow-up word problems.
arXiv Detail & Related papers (2024-07-01T01:56:28Z) - GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks.
One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.
This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z) - MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning [63.80739044622555]
We introduce MuSR, a dataset for evaluating language models on soft reasoning tasks specified in a natural language narrative.
This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm.
Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning.
arXiv Detail & Related papers (2023-10-24T17:59:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.