Fugu-MT 論文翻訳(概要): ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning

論文の概要: ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning

arxiv url: http://arxiv.org/abs/2509.21070v1
Date: Thu, 25 Sep 2025 12:22:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-26 20:58:12.895822
Title: ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning
Title（参考訳）: ScaleDiff: 高度な数学的推論のための難解なスケーリング問題
Authors: Qizhi Pei, Zhuoshi Pan, Honglin Lin, Xin Gao, Yu Li, Zinan Tang, Conghui He, Rui Yan, Lijun Wu,
Abstract要約: 大規模推論モデル (LRM) は複雑な問題解決において顕著な能力を示している。難しい問題の生成をスケールするために設計されたパイプラインであるScaleDiffを提案する。我々のパイプラインは、より大きくて高価な教師モデルに頼ることなく、高度な推論能力を効果的に伝達できることを示します。
参考スコア（独自算出の注目度）: 51.946959481392064
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Reasoning Models (LRMs) have shown impressive capabilities in complex problem-solving, often benefiting from training on difficult mathematical problems that stimulate intricate reasoning. Recent efforts have explored automated synthesis of mathematical problems by prompting proprietary models or large-scale open-source models from seed data or inherent mathematical concepts. However, scaling up these methods remains challenging due to their high computational/API cost, complexity of prompting, and limited difficulty level of the generated problems. To overcome these limitations, we propose ScaleDiff, a simple yet effective pipeline designed to scale the creation of difficult problems. We efficiently identify difficult problems from existing datasets with only a single forward pass using an adaptive thinking model, which can perceive problem difficulty and automatically switch between "Thinking" and "NoThinking" modes. We then train a specialized difficult problem generator (DiffGen-8B) on this filtered difficult data, which can produce new difficult problems in large scale, eliminating the need for complex, per-instance prompting and its associated high API costs. Fine-tuning Qwen2.5-Math-7B-Instruct on the ScaleDiff-Math dataset yields a substantial performance increase of 11.3% compared to the original dataset and achieves a 65.9% average accuracy on AIME'24, AIME'25, HMMT-Feb'25, BRUMO'25, and MATH500, outperforming recent strong LRMs like OpenThinker3. Notably, this performance is achieved using the cost-efficient Qwen3-8B model as a teacher, demonstrating that our pipeline can effectively transfer advanced reasoning capabilities without relying on larger, more expensive teacher models. Furthermore, we observe a clear scaling phenomenon in model performance on difficult benchmarks as the quantity of difficult problems increases. Code: https://github.com/QizhiPei/ScaleDiff.
Abstract（参考訳）: 大規模推論モデル(LRM)は複雑な問題解決において顕著な能力を示しており、しばしば複雑な推論を刺激する難しい数学的問題の訓練の恩恵を受けている。最近の研究は、シードデータや固有の数学的概念からプロプライエタリなモデルや大規模オープンソースモデルを促進することによって、数学問題の自動合成を研究してきた。しかし、これらの手法のスケールアップは、高い計算/APIコスト、プロンプトの複雑さ、生成した問題の難易度が制限されているため、依然として困難である。このような制限を克服するために,難しい問題の生成をスケールするために設計された,シンプルかつ効果的なパイプラインであるScaleDiffを提案する。適応的思考モデルを用いて,1つの前方通過しか持たない既存のデータセットから,問題の難しさを認識でき,自動的に「シンキング」モードと「ノーシンキング」モードを切り替えることができる。次に、このフィルタされた難易度データに基づいて特殊な難易度生成器(DiffGen-8B)をトレーニングし、これにより大規模に新たな難易度を発生させ、複雑なインスタンスごとのプロンプトと関連する高APIコストを不要にする。スケールディフ・マスデータセットの微調整Qwen2.5-Math-7B-インストラクションは、オリジナルのデータセットと比較して11.3%の大幅なパフォーマンス向上を示し、AIME'24、AIME'25、HMMT-Feb'25、BRUMO'25、MATH500で65.9%の平均精度を達成した。特に、この性能はコスト効率のよいQwen3-8Bモデルを教師として利用して達成されており、我々のパイプラインはより大きく高価な教師モデルに頼ることなく、より高度な推論能力を効果的に伝達できることを実証している。さらに,難易度が増大するにつれて,モデル性能に明らかなスケーリング現象が観測される。コード:https://github.com/QizhiPei/ScaleDiff.com

論文の概要: ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning

関連論文リスト