Fugu-MT 論文翻訳(概要): Entropy Centroids as Intrinsic Rewards for Test-Time Scaling

論文の概要: Entropy Centroids as Intrinsic Rewards for Test-Time Scaling

arxiv url: http://arxiv.org/abs/2604.26173v2
Date: Fri, 01 May 2026 11:48:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 13:37:10.916498
Title: Entropy Centroids as Intrinsic Rewards for Test-Time Scaling
Title（参考訳）: テスト時間スケーリングにおける内在的逆流としてのエントロピーセントロイド
Authors: Wenshuo Zhao, Qi Zhu, Xingshan Zeng, Fei Mi, Lifeng Shang, Yi R., Fung,
Abstract要約: 大規模言語モデルのテスト時間計算をスケールアップする効果的な方法は、複数のレスポンスをサンプリングし、最適なものを選択することである。従来のアプローチでは、信頼やエントロピーといった本質的なシグナルを探索してきたが、これらの信号は単純なアグリゲーションにうるさい。提案手法は,複数の候補の中から最も低いエントロピーセントロイドの応答を選択する。
参考スコア（独自算出の注目度）: 40.3805653951238
License: http://creativecommons.org/licenses/by/4.0/
Abstract: An effective way to scale up test-time compute of large language models is to sample multiple responses and then select the best one, as in Grok Heavy and Gemini Deep Think. Existing selection methods often rely on external reward models, which requires training a strong reward model and introduces additional computation overhead. As an alternative, previous approaches have explored intrinsic signals, such as confidence and entropy, but these signals are noisy with naive aggregation. In this work, we observe that high-entropy tokens tend to cluster into consecutive groups during inference, providing a more stable notion of model uncertainty than individual tokens. Together, these clusters reveal temporal patterns of model uncertainty throughout the inference process. Motivated by this observation, we propose to use the temporal structure of uncertainty as an intrinsic reward. To this end, we first formalize the basic unit of segment-level uncertainty as the High Entropy Phase (HEP), a variable-length segment that begins at a high-entropy token and ends when consecutive low-entropy tokens appear. We then define the Entropy Centroid, inspired by the concept of the center of mass in physics, as the weighted average position of all HEPs along the trajectory. Intuitively, a lower centroid indicates early exploration followed by confident generation, which we find often corresponds to higher response quality. Based on this insight, we propose the Lowest Centroid method, which selects the response with the lowest entropy centroid among multiple candidates. Experiments on mathematics, code generation, logical reasoning, and agentic tasks, across model scales ranging from 14B to 480B, show that Lowest Centroid consistently outperforms existing baselines and delivers stable gains as model size increases. Code is available at https://github.com/hkust-nlp/entropy-centroid.
Abstract（参考訳）: 大きな言語モデルのテスト時間計算をスケールアップする効果的な方法は、Grok HeavyやGemini Deep Thinkのように、複数のレスポンスをサンプリングし、最適なものを選択することである。既存の選択方法は、しばしば外部の報酬モデルに依存し、強力な報酬モデルを訓練し、さらなる計算オーバーヘッドを導入する必要がある。代替として、従来のアプローチでは、信頼やエントロピーといった本質的なシグナルを探索してきたが、これらの信号は単純な凝集にうるさい。本研究では,高エントロピートークンは推論中に連続したグループにクラスタリングする傾向にあり,個々のトークンよりもモデルの不確実性のより安定した概念を提供する。これらのクラスターは、推論プロセスを通してモデル不確実性の時間的パターンを明らかにする。そこで本研究では,不確実性の時間構造を本質的な報酬として活用することを提案する。この目的のために、まず、高エントロピー位相(HEP)としてセグメントレベルの不確実性の基本的な単位を定式化し、高エントロピートークンから始まる可変長セグメントを連続的に低エントロピートークンが現れると終了する。次に、エントロピー・セントロイド(Entropy Centroid)を、物理における質量の中心の概念に着想を得て、軌道に沿った全てのHEPの重み付き平均位置として定義する。直感的には、低セントロイドは初期の探索に続いて自信のある生成を示すが、これは高い応答品質に対応することがよくある。この知見に基づいて,複数の候補の中から最も低いエントロピーセントロイドの応答を選択するLowest Centroid法を提案する。数学、コード生成、論理的推論、エージェント的タスクに関する実験は、14Bから480Bまでのモデルスケールにわたって、Lowest Centroidが既存のベースラインを一貫して上回り、モデルサイズが大きくなるにつれて安定したゲインを提供することを示している。コードはhttps://github.com/hkust-nlp/entropy-centroidで入手できる。

論文の概要: Entropy Centroids as Intrinsic Rewards for Test-Time Scaling

関連論文リスト