Related papers: Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision

Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision

URL: http://arxiv.org/abs/2512.15489v1
Date: Wed, 17 Dec 2025 14:37:41 GMT
Title: Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision
Authors: Wei Du, Shubham Toshniwal, Branislav Kisacanin, Sadegh Mahdavi, Ivan Moshkov, George Armstrong, Stephen Ge, Edgar Minasyan, Feng Chen, Igor Gitman,
Abstract summary: Nemotron-Math is a large-scale mathematical reasoning dataset containing 7.5M solution traces.<n>The dataset integrates 85K curated AoPS problems with 262K community-sourced StackExchange-Math problems.<n>Nemotron-Math consistently outperforms the original OpenMathing on matched AoPS problems.
Score: 15.319195064020393
License: http://creativecommons.org/licenses/by/4.0/
Abstract: High-quality mathematical reasoning supervision requires diverse reasoning styles, long-form traces, and effective tool integration, capabilities that existing datasets provide only in limited form. Leveraging the multi-mode generation ability of gpt-oss-120b, we introduce Nemotron-Math, a large-scale mathematical reasoning dataset containing 7.5M solution traces across high, medium, and low reasoning modes, each available both with and without Python tool-integrated reasoning (TIR). The dataset integrates 85K curated AoPS problems with 262K community-sourced StackExchange-Math problems, combining structured competition tasks with diverse real-world mathematical queries. We conduct controlled evaluations to assess the dataset quality. Nemotron-Math consistently outperforms the original OpenMathReasoning on matched AoPS problems. Incorporating StackExchange-Math substantially improves robustness and generalization, especially on HLE-Math, while preserving accuracy on math competition benchmarks. To support efficient long-context training, we develop a sequential bucketed strategy that accelerates 128K context-length fine-tuning by 2--3$\times$ without significant accuracy loss. Overall, Nemotron-Math enables state-of-the-art performance, including 100\% maj@16 accuracy on AIME 2024 and 2025 with Python TIR.

Related papers

WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning [51.13280433665446]
Large language models (LLMs) excel at general mathematical reasoning but fail catastrophically on specialized technical mathematics.<n>In wireless communications, where problems require precise manipulation of information-theoretic bounds, even state-of-the-art models struggle to achieve competent performance.<n>We present WirelessMathLM, demonstrating that compact models (0.5B-7B parameters) can match or exceed much larger models through domain-specific reinforcement learning.
arXiv Detail & Related papers (2025-09-27T09:58:03Z)
SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers [13.763623961742391]
We introduce textbfSAND-Math (textbfSynthetic textbfAugmented textbfNovel and textbfDifficult Mathematics problems and solutions), a pipeline that addresses high-quality problems from scratch.<n>We demonstrate the effectiveness of our approach through two key findings.
arXiv Detail & Related papers (2025-07-28T05:17:48Z)
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning [95.31714779585272]
DeepMath-103K is a large-scale mathematical dataset designed with high difficulty (primarily levels 5-9)<n>It includes rigorous decontamination against numerous benchmarks, and verifiable answers for rule-based RL reward.<n>DeepMath-103K fosters the development of generalizable and advancing reasoning.
arXiv Detail & Related papers (2025-04-15T17:59:51Z)
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs.<n>OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z)
PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [59.920971312822736]
We introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems.<n>The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction.<n>Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods.
arXiv Detail & Related papers (2025-03-04T06:32:30Z)
UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models [11.964085209696051]
UGMathBench comprises 5,062 problems across 16 subjects and 111 topics, featuring 10 distinct answer types.<n>Each problem includes three randomized versions, with additional versions planned for release as leading open-source LLMs become saturated in UGMathBench.<n>Our evaluation of 23 leading LLMs reveals that the highest EAcc robustness achieved is 56.3% by OpenAI-o1-mini, with large $Delta$ values observed across different models.
arXiv Detail & Related papers (2025-01-23T15:46:43Z)
UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts [7.856746367263317]
This paper introduces the UTMath Benchmark, a robust evaluation framework designed to assess Large Language Models.<n>It comprises 1,053 cutting-edge problems spanning nine mathematical domains, with an average of 68 test cases per problem.<n>The best-performing model, o1-mini, solving only 32.57% of the problems, followed by o1-preview at 27.16%, and GPT-4o at 26.93%.
arXiv Detail & Related papers (2024-11-11T18:59:02Z)
InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning [13.728595670907136]
We introduce InfinityMATH, a scalable instruction tuning dataset for programmatic mathematical reasoning. Fine-tuning experiments with open-source language and code models, such as Llama2 and CodeLlama, demonstrate the practical benefits of InfinityMATH.
arXiv Detail & Related papers (2024-08-09T08:18:20Z)
DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning [24.68321102981711]
We introduce a series of large language models (LLMs) that employ the Decomposition of thought with code assistance and self-correction for mathematical reasoning, dubbed as DotaMath. DotaMath models tackle complex mathematical tasks by decomposing them into simpler logical subtasks, leveraging code to solve these subtasks, and engaging in self-reflection and correction. We train a series of base LLMs using imitation learning on DotaMathQA, resulting in DotaMath models that achieve remarkable performance compared to open-source LLMs.
arXiv Detail & Related papers (2024-07-04T17:39:16Z)
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models [62.815222721144636]
We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K. This novel approach significantly improves the multimodal mathematical reasoning capabilities of LLaVA-1.5. Math-LLaVA demonstrates enhanced generalizability, showing substantial improvements on the MMMU benchmark.
arXiv Detail & Related papers (2024-06-25T05:43:21Z)
MathScale: Scaling Instruction Tuning for Mathematical Reasoning [70.89605383298331]
Large language models (LLMs) have demonstrated remarkable capabilities in problem-solving. However, their proficiency in solving mathematical problems remains inadequate. We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data.
arXiv Detail & Related papers (2024-03-05T11:42:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.