Fugu-MT 論文翻訳(概要): The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning

論文の概要: The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning

arxiv url: http://arxiv.org/abs/2606.16152v1
Date: Mon, 15 Jun 2026 03:13:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 16:21:34.047962
Title: The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning
Title（参考訳）: 品質と実用のパラドックス:なぜ逆データが小さなモデル数学的推論を損なうのか
Authors: Haolong Qian, Xianliang Yang, Yinuo ma, Lirong Che, Feng Lu, Ye Guo, Lei Song, Jiang Bian, Chun Yuan,
Abstract要約: より強力なOracleによって洗練または合成されたデータは、報酬モデルに従って高い品質を得る。我々の分析によると、Oracleの洗練は、SLMのネイティブな推論分布から逸脱した分布的ドリフトによる論理的修復と結合している。これらの結果から, 有効数学的推論蒸留法は, 認識された解の質と学習者データの互換性を協調的に最適化することが示唆された。
参考スコア（独自算出の注目度）: 54.477658074293885
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Knowledge distillation from powerful reasoning models is widely used to improve Small Language Models (SLMs) on mathematical reasoning, often assuming that traces with higher reward model scores provide more useful supervision. We identify a counterintuitive \textbf{Quality-Utility Paradox} in mathematical reasoning distillation. Data refined or synthesized by a stronger Oracle obtains higher perceived quality according to reward models, yet consistently underperforms traces generated by the SLM itself and selected through rejection sampling across Qwen2.5, LLaMA-3, and DeepSeek families. Our analysis shows that Oracle refinement couples logical repair with distributional drift away from the SLM's native reasoning distribution. This drift increases the learner's adaptation cost and can outweigh the benefit of improved reasoning logic. To test this mechanism, we introduce \textbf{Style-Aligned Refinement}, which preserves the native trajectory of the SLM while retaining logical repair from the Oracle. This intervention lowers adaptation cost and restores downstream utility. These findings suggest that effective mathematical reasoning distillation should jointly optimize perceived solution quality and learner-data compatibility, rather than relying solely on reward-model scores. The datasets and code are available at https://github.com/Dracoqhl/Quality-Utility-Paradox.
Abstract（参考訳）: 強力な推論モデルからの知識蒸留は、数学的推論において小言語モデル(SLM)を改善するために広く用いられ、しばしばより高い報酬モデルスコアを持つトレースがより有用な監督を提供すると仮定する。数学的推論蒸留における反直観的 \textbf{Quality-Utility Paradox} を同定する。より強力なOracleによって洗練または合成されたデータは、報酬モデルに従って高い品質を得るが、SLM自身によって生成されるトレースを一貫して過小評価し、Qwen2.5、LLaMA-3、DeepSeekファミリのリジェクションサンプリングによって選択される。我々の分析によると、Oracleの洗練は、SLMのネイティブな推論分布から逸脱した分布的ドリフトによる論理的修復と結合している。このドリフトは学習者の適応コストを高め、推論ロジックの改善の利点を上回ることができる。このメカニズムをテストするために,Oracle からの論理的修復を維持しながら SLM のネイティブな軌道を保ちながら,SLM のネイティブな軌道を保った \textbf{Style-Aligned Refinement} を導入する。この介入は適応コストを下げ、下流のユーティリティを復元する。これらの結果から, 有効数学的推論蒸留法は, 報酬モデルスコアのみに頼るのではなく, 認識された解の質と学習者データの適合性を協調的に最適化することが示唆された。データセットとコードはhttps://github.com/Dracoqhl/Quality-Utility-Paradoxで公開されている。

論文の概要: The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning

関連論文リスト