Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL
- URL: http://arxiv.org/abs/2602.13035v1
- Date: Fri, 13 Feb 2026 15:42:59 GMT
- Title: Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL
- Authors: Yixiao Zhou, Yang Li, Dongzhou Cheng, Hehe Fan, Yu Cheng,
- Abstract summary: We propose a hierarchical reinforcement learning framework that learns to control sampling temperature during generation.<n>At each decoding step, the model selects a temperature based on its hidden state and samples the next token from the resulting distribution.<n>Temperature and token policies are jointly optimized from downstream rewards using a coordinate ascent scheme.
- Score: 30.357975264905978
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) trains large language models (LLMs) from sampled trajectories, making decoding strategy a core component of learning rather than a purely inference-time choice. Sampling temperature directly controls the exploration--exploitation trade-off by modulating policy entropy, yet existing methods rely on static values or heuristic adaptations that are decoupled from task-level rewards. We propose Introspective LLM, a hierarchical reinforcement learning framework that learns to control sampling temperature during generation. At each decoding step, the model selects a temperature based on its hidden state and samples the next token from the resulting distribution. Temperature and token policies are jointly optimized from downstream rewards using a coordinate ascent scheme. Experiments on mathematical reasoning benchmarks show that learned temperature policies outperform fixed and heuristic baselines, while exhibiting interpretable exploration behaviors aligned with reasoning uncertainty.
Related papers
- Temperature as a Meta-Policy: Adaptive Temperature in LLM Reinforcement Learning [47.83947232413507]
Temperature controls the trade-off between exploration and exploitation in large language models (LLMs)<n>High temperatures encourage diverse but noisy outputs, while low temperatures produce focused outputs but may cause premature convergence.<n>We propose Temperature Adaptive Meta Policy Optimization (TAMPO), a new framework that recasts temperature control as a learnable meta-policy.
arXiv Detail & Related papers (2026-02-12T09:59:58Z) - Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs [49.66344956133349]
Reasoning capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models.<n>This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a latent variable for strategic contextualization.
arXiv Detail & Related papers (2025-12-19T03:32:53Z) - Policy Gradient-Based EMT-in-the-Loop Learning to Mitigate Sub-Synchronous Control Interactions [0.2609784101826761]
This paper explores the development of learning-based control gains to address sub-synchronous oscillations.<n>We employ a learning-based framework that considers the grid conditions responsible for such sub-synchronous oscillations.<n>Our experimentation in a real-world event setting demonstrates that the deep policy gradient based trained policy can adaptively compute gain settings.
arXiv Detail & Related papers (2025-11-08T03:12:29Z) - LLM-Oriented Token-Adaptive Knowledge Distillation [64.08412563818662]
We propose a novel framework that adapts the distillation process to the real-time learning state of each token.<n>AdaKD consists of two synergistic modules driven by a unified token difficulty metric.<n>As a plug-and-play framework, AdaKD can consistently improve the performance of various distillation methods on multiple model architectures and benchmarks.
arXiv Detail & Related papers (2025-10-13T16:55:07Z) - Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning [29.277754405630205]
Reinforcement learning with verifiable rewards (RLVR) is a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs)<n>Standard fixed-temperature sampling is simple, but it struggles to balance these competing demands, as high temperatures degrade sample quality and low temperatures limit discovery.<n>We propose a simpler and more effective strategy, Exploratory Annealed Decoding (EAD), grounded in the insight that exploration is most impactful on early tokens.
arXiv Detail & Related papers (2025-10-06T18:15:43Z) - One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient [16.05489579792086]
We introduce one-token rollout (OTR), a novel fine-tuning algorithm that guides SFT with the policy gradient method.<n>OTR reframes the autoregressive learning process by treating each token generation as a single-step reinforcement learning trajectory.<n>Our findings establish OTR as a powerful and practical alternative for fine-tuning LLMs.
arXiv Detail & Related papers (2025-09-30T14:25:56Z) - Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling [70.8832906871441]
We study how to steer generation toward desired rewards without retraining the models.<n>Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement.<n>We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity.
arXiv Detail & Related papers (2025-07-11T08:00:47Z) - Fine-Tuning Language Models with Reward Learning on Policy [68.70065254564642]
Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences.
Despite its popularity, (fixed) reward models may suffer from inaccurate off-distribution.
We propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution.
arXiv Detail & Related papers (2024-03-28T10:02:10Z) - Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks.
We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level.
We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z) - Data Assimilation in Chaotic Systems Using Deep Reinforcement Learning [0.5999777817331317]
Data assimilation plays a pivotal role in diverse applications, ranging from climate predictions and weather forecasts to trajectory planning for autonomous vehicles.
Recent advancements have seen the emergence of deep learning approaches in this domain, primarily within a supervised learning framework.
In this study, we introduce a novel DA strategy that utilizes reinforcement learning (RL) to apply state corrections using full or partial observations of the state variables.
arXiv Detail & Related papers (2024-01-01T06:53:36Z) - Is Inverse Reinforcement Learning Harder than Standard Reinforcement
Learning? A Theoretical Perspective [55.36819597141271]
Inverse Reinforcement Learning (IRL) -- the problem of learning reward functions from demonstrations of an emphexpert policy -- plays a critical role in developing intelligent systems.
This paper provides the first line of efficient IRL in vanilla offline and online settings using samples and runtime.
As an application, we show that the learned rewards can emphtransfer to another target MDP with suitable guarantees.
arXiv Detail & Related papers (2023-11-29T00:09:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.