Fugu-MT 論文翻訳(概要): Wait, Wait, Wait... Why Do Reasoning Models Loop?

論文の概要: Wait, Wait, Wait... Why Do Reasoning Models Loop?

arxiv url: http://arxiv.org/abs/2512.12895v1
Date: Mon, 15 Dec 2025 00:44:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-16 17:54:56.486384
Title: Wait, Wait, Wait... Why Do Reasoning Models Loop?
Title（参考訳）: 待ち、待ち、待ち...なぜリアクションモデルがループするのか?
Authors: Charilaos Pipis, Shivam Garg, Vasilis Kontonis, Vaishnavi Shrivastava, Akshay Krishnamurthy, Dimitris Papailiopoulos,
Abstract要約: 推論モデルは、しばしばループし、同じテキストを低温または強欲な復号で繰り返します。開理モデルでは、ループは低温では一般的である。このことは、トレーニング分布と学習モデルのミスマッチを指し、学習におけるエラーと呼ぶ。
参考スコア（独自算出の注目度）: 38.291893062636035
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reasoning models (e.g., DeepSeek-R1) generate long chains of thought to solve harder problems, but they often loop, repeating the same text at low temperatures or with greedy decoding. We study why this happens and what role temperature plays. With open reasoning models, we find that looping is common at low temperature. Larger models tend to loop less, and distilled students loop significantly even when their teachers rarely do. This points to mismatches between the training distribution and the learned model, which we refer to as errors in learning, as a key cause. To understand how such errors cause loops, we introduce a synthetic graph reasoning task and demonstrate two mechanisms. First, risk aversion caused by hardness of learning: when the correct progress-making action is hard to learn but an easy cyclic action is available, the model puts relatively more probability on the cyclic action and gets stuck. Second, even when there is no hardness, Transformers show an inductive bias toward temporally correlated errors, so the same few actions keep being chosen and loops appear. Higher temperature reduces looping by promoting exploration, but it does not fix the errors in learning, so generations remain much longer than necessary at high temperature; in this sense, temperature is a stopgap rather than a holistic solution. We end with a discussion of training-time interventions aimed at directly reducing errors in learning.
Abstract（参考訳）: 推論モデル(例:DeepSeek-R1)は、難しい問題を解くための長いチェーンを生成するが、それらはしばしばループし、低温または強欲な復号で同じテキストを繰り返す。この現象がなぜ起こるのか、温度がどのような役割を果たすのかを研究します。開理モデルでは、ループは低温では一般的である。より大型のモデルはよりループが少なくなり、教師がほとんどしない場合でも蒸留された学生は著しくループする。このことは、トレーニング分布と学習モデルのミスマッチを、私たちが学習におけるエラーと呼んでいるもので、それが重要な原因であることを示している。このようなエラーがループの原因となるのかを理解するため、合成グラフ推論タスクを導入し、2つのメカニズムを実証する。まず、学習の困難によって引き起こされるリスク回避:正しい進歩的行動が学習し難いが、簡単な循環的行動が利用できる場合、モデルは循環的行動に比較的高い確率を与え、立ち往生する。第二に、困難がなくても、トランスフォーマーは時間的に相関したエラーに対して帰納的バイアスを示すため、同じ少数のアクションが選択され、ループが現れる。高温は探索を促進することでループを減少させるが、学習の誤りを修正しないため、世代は高温で必要以上に長く保たれる。学習におけるエラーの直接低減を目的とした,トレーニング時間の介入に関する議論に終止符を打つ。

論文の概要: Wait, Wait, Wait... Why Do Reasoning Models Loop?

関連論文リスト