Fugu-MT 論文翻訳(概要): Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR

論文の概要: Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR

arxiv url: http://arxiv.org/abs/2510.08892v1
Date: Fri, 10 Oct 2025 01:11:42 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:47.909314
Title: Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR
Title（参考訳）: RLVRにおけるToken- and Rollout-Level制御のためのマルチ温度戦略の探索
Authors: Haomin Zhuang, Yujun Zhou, Taicheng Guo, Yue Huang, Fangxu Liu, Kai Song, Xiangliang Zhang,
Abstract要約: 本稿では,異なるトークンに対して異なる温度設定を適用することで,サンプリング中の探索を明示的に促進する補完的アプローチを提案する。具体的には, 知識トークンの温度を低く保ち, 事実の正しさを保ちながら, 探索を活発に進めるために, 推論トークンの高温を用いる。
参考スコア（独自算出の注目度）: 32.766524277613826
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning has demonstrated substantial improvements in the reasoning abilities of Large Language Models (LLMs), exhibiting significant applicability across various domains. Recent research has identified that tokens within LLMs play distinct roles during reasoning tasks, categorizing them into high-entropy reasoning tokens and low-entropy knowledge tokens. Prior approaches have typically focused on restricting updates to indirectly encourage exploration, yet they do not explicitly facilitate exploratory behavior during the token generation stage itself. In this work, we introduce a complementary approach that explicitly promotes exploration during sampling by applying distinct temperature settings for different token types. Specifically, our method employs higher temperatures for reasoning tokens to actively encourage exploration, while retaining lower temperatures for knowledge tokens to maintain factual correctness. Furthermore, we systematically investigate various multi-temperature scheduling strategies and their impacts within reinforcement learning contexts. Empirical evaluations on several reasoning benchmarks demonstrate that our approach significantly enhances the reasoning performance of LLMs. The code is available at https://github.com/zhmzm/Multi_Temperature_Verl.git.
Abstract（参考訳）: 強化学習は、Large Language Models (LLM) の推論能力を大幅に改善し、様々な領域に適用可能であることを示した。近年の研究では、LLM内のトークンは推論タスクにおいて異なる役割を担い、それらを高エントロピー推論トークンと低エントロピー知識トークンに分類している。従来のアプローチでは、通常、間接的に探索を奨励するために更新を制限することに重点を置いていたが、トークン生成段階自体の探索的振る舞いを明示的に促進するものではない。本研究では,異なるトークンに対して異なる温度設定を適用することで,サンプリング中の探索を明示的に促進する補完的アプローチを提案する。具体的には, 知識トークンの温度を低く保ち, 事実の正しさを保ちながら, 探索を活発に進めるために, 推論トークンの高温を用いる。さらに,強化学習環境における多温度スケジューリング戦略とその影響を系統的に検討した。提案手法は, LLMの推論性能を大幅に向上させることを示す。コードはhttps://github.com/zhmzm/Multi_Temperature_Verl.gitで公開されている。

論文の概要: Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR

関連論文リスト