Fugu-MT 論文翻訳(概要): Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning

論文の概要: Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning

arxiv url: http://arxiv.org/abs/2601.04805v1
Date: Thu, 08 Jan 2026 10:38:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-09 17:01:53.162282
Title: Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning
Title（参考訳）: 思考に基づくノンシンキング:強化学習によるハイブリッド推論モデルの訓練における逆ハック問題の解決
Authors: Siyuan Gan, Jiaheng Liu, Boyan Wang, Tianpei Yang, Runqing Miao, Yuyao Zhang, Fanyu Meng, Junlan Feng, Linjian Meng, Jing Huo, Yang Gao,
Abstract要約: Thinking-Based Non-Thinkingは、さまざまなクエリに対する思考を使用しない応答に対して、異なる最大トークン使用量を設定する。 5つの数学ベンチマークの実験により、TNTはトークンの使用量を約50%削減することを示した。 TNTの応答における報酬ハッキングの確率は、思考を使用しないものとして分類されているが、依然として10%以下である。
参考スコア（独自算出の注目度）: 57.57084309580296
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large reasoning models (LRMs) have attracted much attention due to their exceptional performance. However, their performance mainly stems from thinking, a long Chain of Thought (CoT), which significantly increase computational overhead. To address this overthinking problem, existing work focuses on using reinforcement learning (RL) to train hybrid reasoning models that automatically decide whether to engage in thinking or not based on the complexity of the query. Unfortunately, using RL will suffer the the reward hacking problem, e.g., the model engages in thinking but is judged as not doing so, resulting in incorrect rewards. To mitigate this problem, existing works either employ supervised fine-tuning (SFT), which incurs high computational costs, or enforce uniform token limits on non-thinking responses, which yields limited mitigation of the problem. In this paper, we propose Thinking-Based Non-Thinking (TNT). It does not employ SFT, and sets different maximum token usage for responses not using thinking across various queries by leveraging information from the solution component of the responses using thinking. Experiments on five mathematical benchmarks demonstrate that TNT reduces token usage by around 50% compared to DeepSeek-R1-Distill-Qwen-1.5B/7B and DeepScaleR-1.5B, while significantly improving accuracy. In fact, TNT achieves the optimal trade-off between accuracy and efficiency among all tested methods. Additionally, the probability of reward hacking problem in TNT's responses, which are classified as not using thinking, remains below 10% across all tested datasets.
Abstract（参考訳）: 大規模な推論モデル(LRM)は、その例外的な性能のために多くの注目を集めている。しかし、その性能は主に思考、長い思考の連鎖(CoT)に起因し、計算オーバーヘッドを著しく増加させる。この問題に対処するために、既存の研究は強化学習(RL)を使用して、クエリの複雑さに基づいて思考に関わるか否かを自動決定するハイブリッド推論モデルのトレーニングに重点を置いている。残念ながら、RLの使用は、例えば、モデルが思考に関わるような報酬ハックの問題に悩まされるが、そうはならないと判断され、誤った報酬をもたらす。この問題を軽減するため、既存の研究では教師付き微調整(SFT)を採用し、高い計算コストを発生させるか、非思考応答に均一なトークン制限を課し、問題の緩和を制限している。本稿では,Thinking-Based Non-Thinking (TNT)を提案する。 SFTを使用せず、様々なクエリをまたいだ思考を使わずに、思考を用いて応答のソリューションコンポーネントからの情報を活用することで、応答の最大トークン使用量を異なるものに設定する。 5つの数学ベンチマークの実験では、TNTはDeepSeek-R1-Distill-Qwen-1.5B/7BやDeepScaleR-1.5Bに比べてトークン使用率を約50%削減し、精度は大幅に向上した。実際、TNTは全ての試験方法の精度と効率の最適なトレードオフを実現する。さらに、TNTの応答における報酬ハッキングの確率は、思考を使用しないものとして分類され、テストされたすべてのデータセットで10%以下である。

論文の概要: Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning

関連論文リスト