Fugu-MT 論文翻訳(概要): The Markovian Thinker

論文の概要: The Markovian Thinker

arxiv url: http://arxiv.org/abs/2510.06557v1
Date: Wed, 08 Oct 2025 01:18:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-09 16:41:20.252568
Title: The Markovian Thinker
Title（参考訳）: マルコフ思想家
Authors: Milad Aghajohari, Kamran Chitsaz, Amirhossein Kazemnejad, Sarath Chandar, Alessandro Sordoni, Aaron Courville, Siva Reddy,
Abstract要約: 強化学習(Reinforcement Learning, RL)は、LongCoT(LongCoT)という長鎖のLLMを学習するための強力なレシピとなっている。しかし、状態がプロンプトプラス全ての先行推論トークンである標準的なRLの「思考環境」は、州を無拘束にし、思考が長くなるにつれて注意に基づく政策に二次計算を支払うよう強制する。我々は,一定サイズの状態に条件付けしながら,政策が推論を進めるパラダイムであるマルコフ的思考を提案する。
参考スコア（独自算出の注目度）: 70.4118072391945
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.
Abstract（参考訳）: 強化学習(Reinforcement Learning, RL)は、LongCoT(LongCoT)という長鎖のLLMを学習するための強力なレシピとなっている。しかし、状態がプロンプトプラス全ての先行推論トークンである標準的なRLの「思考環境」は、州を無拘束にし、思考が長くなるにつれて注意に基づく政策に二次計算を支払うよう強制する。我々は環境そのものを再考する。本稿では,一定サイズの状態に条件付けしながら推論を進め,文脈サイズから思考長を分離するパラダイムであるマルコフ的思考を提案する。結果として、これは定数メモリを持つ線形計算となる。私たちはこのアイデアを、固定サイズのチャンクに推論を構造化するRL環境であるDelethinkでインスタンス化します。境界では、環境がコンテキストをリセットし、短い操作でプロンプトを再起動する。 RLを通じて、ポリシーは、リセット後の推論をシームレスに継続するのに十分な、各チャンクの端付近のテキスト状態を記述することを学ぶ。この環境で訓練されたR1-Distill 1.5Bモデルは、24Kの予算で訓練されたLongCoT-RLに匹敵する、最大24Kのトークンをまだ考えていない。テストタイムのスケーリングによって、DelethinkはLongCoT台地を改善し続けている。線形計算の効果はかなり大きく、Delethinkでは平均思考長96KのLongCoT-RLが27H100ヶ月対7である。 RLの初期化における分析では、市販の推論モデル (1.5B-120B) が様々なベンチマークでMarkovianトレースをゼロショットでサンプリングし、RLを大規模に効果的にするための正のサンプルを提供する。その結果、思考環境の再設計は強力なレバーであり、2次的オーバーヘッドを伴わずに非常に長い推論を可能にし、効率よくスケーラブルなLCMへの道を開いた。

論文の概要: The Markovian Thinker

関連論文リスト