Fugu-MT 論文翻訳(概要): Rethinking Thinking Tokens: LLMs as Improvement Operators

論文の概要: Rethinking Thinking Tokens: LLMs as Improvement Operators

arxiv url: http://arxiv.org/abs/2510.01123v1
Date: Wed, 01 Oct 2025 17:08:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.685107
Title: Rethinking Thinking Tokens: LLMs as Improvement Operators
Title（参考訳）: 思考トークンを再考する - 改善オペレータとしてのLLM
Authors: Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhutdinov, Manzil Zaheer, Sanjeev Arora, Anirudh Goyal,
Abstract要約: 推論トレーニングは、LLMに長い思考の連鎖(長いCoT)を生み出す動機を与え、自己チェックによるソリューション戦略を探索することを可能にする。これにより、精度が高くなりますが、コンテキストの長さ、トークン/計算コスト、応答レイテンシが膨らみます。現在のモデルはメタ認知を活用して、このParetoフロンティアで他の組み合わせを提供できるのでしょうか? i) 多様なドラフトを並列に生成し、(ii) それらを有界なテキストワークスペースに蒸留し、(iii) このワークスペース上に条件付き精製する。
参考スコア（独自算出の注目度）: 80.12087211785949
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which among other things, allows them to explore solution strategies with self-checking. This results in higher accuracy, but inflates context length, token/compute cost, and answer latency. We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier, e.g., better accuracy with lower context length and/or latency? Abstractly, we view the model as an improvement operator on its own "thoughts" with a continuum of possible strategies. We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace, producing an output that seeds the next round. Importantly, context length (hence compute cost) is controllable via degree of parallelism, and is no longer conflated with the total number of generated tokens. We report PDR instantiations of current models that give better accuracy than long CoT while incurring lower latency. Setting degree of parallelism to 1 yields an interesting subcase, Sequential Refinement (SR) (iteratively improve a single candidate answer) which provides performance superior to long CoT. Success of such model orchestrations raises the question whether further training could shift the Pareto frontier. To this end, we train an 8B thinking model with Reinforcement Learning (RL) to make it consistent with PDR as the inference method. On math tasks with verifiable answers, iterative pipelines surpass single-pass baselines at matched sequential budgets, with PDR delivering the largest gains (e.g., +11% on AIME 2024 and +9% on AIME 2025).
Abstract（参考訳）: 推論トレーニングは、LLMに長い思考の連鎖(長いCoT)を生み出すインセンティブを与えます。これにより、精度が高くなりますが、コンテキストの長さ、トークン/計算コスト、応答レイテンシが膨らみます。現在のモデルはメタ認知を利用して,このParetoフロンティア上で他の組み合わせを提供することができるのでしょうか? 抽象的には、我々はモデルを、可能な戦略の連続体を伴って、独自の「思考」における改善演算子と見なしている。以下に示すように、PDR(Parallel-Distill-Refine)の興味深い推論ファミリを同定する。 (i)多種多様な草稿を並行して作成すること。 (二)有界テキストワークスペースに蒸留し、三この作業空間を精錬し、次ラウンドで種を種とする出力を生成すること。重要なことに、コンテクスト長(計算コスト)は並列性の程度で制御可能であり、生成したトークンの総数と混同されなくなる。レイテンシを低くしながら,CoTよりも精度の高い現行モデルのPDRインスタンス化を報告する。並列性の度合いを 1 に設定すると、興味深いサブケースであるSequential Refinement (SR) が得られ、長い CoT よりも優れたパフォーマンスを提供する。このようなモデルオーケストレーションの成功は、さらなるトレーニングがParetoフロンティアをシフトできるかどうかという疑問を提起する。そこで本研究では,Reinforcement Learning (RL) を用いた8B思考モデルを構築し,PDRを推論手法として整合させる。検証可能な回答を持つ数学タスクでは、反復パイプラインが整合した逐次予算でシングルパスベースラインを超え、PDRは最大のゲイン(例えば、AIME 2024では+11%、AIME 2025では+9%)を提供する。

論文の概要: Rethinking Thinking Tokens: LLMs as Improvement Operators

関連論文リスト