Fugu-MT 論文翻訳(概要): HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

論文の概要: HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

arxiv url: http://arxiv.org/abs/2605.08283v1
Date: Fri, 08 May 2026 07:38:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:49.528797
Title: HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
Title（参考訳）: HTPO:階層的トークンレベル客観制御による探索・探索均衡政策最適化を目指して
Authors: Xincheng Yao, Ruoqi Li, Cheng Chen, Daoxin Zhang, Yi Wu, Yao Hu, Chongyang Zhang,
Abstract要約: RLVR(Reinforcement Learning with Verifiable Rewards)は、Large Language Models(LLM)の推論能力を高めるための重要なテクニックとして登場した。 CoT(Chain-of-Thought)推論では、異なるトークンが通常、異なる役割を担っている。それゆえ、現在のRLアルゴリズムは、学習中に探索と探索のトレードオフを動的にバランスさせる効果的なメカニズムを欠いている。本稿では,HTPO (Hierarchical Token-level Objective Control Policy Optimization) を提案する。
参考スコア（独自算出の注目度）: 26.21217251968049
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a pivotal technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, the de facto practice of mainstream RL algorithms is to treat all tokens of one response equally and assign the same optimization objective to each token, failing to provide granular guidance for the reasoning process. While in Chain-of-Thought (CoT) reasoning, different tokens usually play distinct roles. Therefore, the current RL algorithms lack an effective mechanism to dynamically balance the exploration-exploitation trade-off during learning. To this end, we propose Hierarchical Token-level Objective Control Policy Optimization (HTPO), a novel RL algorithm that takes the divide-and-conquer idea to hierarchically partition the response tokens into specific functional groups from three aspects (i.e., prompt difficulty, answer correctness, and token entropy). Within each group, according to the contributions to exploration or exploitation, we design specialized optimization objectives to facilitate the effective execution of each token's expected functionality. In this way, HTPO can achieve a more balanced exploration-exploitation trade-off. Extensive experiments on challenging reasoning benchmarks validate the superiority of our HTPO algorithm, which significantly outperforms the strong DAPO baseline (e.g., +8.6% and +6.7% on AIME'24 and AIME'25, respectively). When scaling test-time compute, the HTPO-trained model maintains a consistent performance advantage over the DAPO baseline, and the gap widens as the sampling budget increases, validating that our adaptive token-level control method fosters effective exploration without sacrificing exploitation performance. Code will be at https://github.com/xcyao00/HTPO.
Abstract（参考訳）: Reinforcement Learning with Verifiable Rewards (RLVR) は、Large Language Models (LLM) の推論能力を高める重要な手法として登場した。しかし、主流のRLアルゴリズムの事実上の実践は、1つの応答の全てのトークンを等しく扱い、同じ最適化目標を各トークンに割り当てることであり、推論プロセスの詳細なガイダンスを提供していない。 CoT(Chain-of-Thought)推論では、異なるトークンが異なる役割を演じるのが普通である。したがって、現在のRLアルゴリズムは、学習中に探索と探索のトレードオフを動的にバランスさせる効果的なメカニズムを欠いている。この目的を達成するために,HTPO (Hierarchical Token-level Objective Control Policy Optimization) を提案する。このアルゴリズムは,分散・コンカレントなアイデアを用いて,応答トークンを3つの機能群(即時困難,応答正当性,トークンエントロピー)から階層的に分割する。各グループ内では、探索やエクスプロイトへの貢献に基づいて、各トークンの期待する機能の効率的な実行を容易にするために、特別な最適化目標を設計します。このようにして、HTPOはよりバランスのとれた探査・探査のトレードオフを達成することができる。 HTPOアルゴリズムは強いDAPOベースライン(例えば AIME'24 と AIME'25 では +8.6% と +6.7% )を著しく上回っている。テスト時間計算のスケーリングにおいて,HTPO学習モデルではDAPOベースラインに対して一貫した性能上の優位性を維持し,サンプリング予算の増加とともにギャップが拡大する。コードはhttps://github.com/xcyao00/HTPO。

論文の概要: HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

関連論文リスト