Related papers: Temperature as a Meta-Policy: Adaptive Temperature in LLM Reinforcement Learning

Temperature as a Meta-Policy: Adaptive Temperature in LLM Reinforcement Learning

URL: http://arxiv.org/abs/2602.11779v1
Date: Thu, 12 Feb 2026 09:59:58 GMT
Title: Temperature as a Meta-Policy: Adaptive Temperature in LLM Reinforcement Learning
Authors: Haoran Dang, Cuiling Lan, Hai Wan, Xibin Zhao, Yan Lu,
Abstract summary: Temperature controls the trade-off between exploration and exploitation in large language models (LLMs)<n>High temperatures encourage diverse but noisy outputs, while low temperatures produce focused outputs but may cause premature convergence.<n>We propose Temperature Adaptive Meta Policy Optimization (TAMPO), a new framework that recasts temperature control as a learnable meta-policy.
Score: 47.83947232413507
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Temperature is a crucial hyperparameter in large language models (LLMs), controlling the trade-off between exploration and exploitation during text generation. High temperatures encourage diverse but noisy outputs, while low temperatures produce focused outputs but may cause premature convergence. Yet static or heuristic temperature schedules fail to adapt to the dynamic demands of reinforcement learning (RL) throughout training, often limiting policy improvement. We propose Temperature Adaptive Meta Policy Optimization (TAMPO), a new framework that recasts temperature control as a learnable meta-policy. TAMPO operates through a hierarchical two-loop process. In the inner loop, the LLM policy is updated (e.g., using GRPO) with trajectories sampled at the temperature selected by the meta-policy. In the outer loop, meta-policy updates the distribution over candidate temperatures by rewarding those that maximize the likelihood of high-advantage trajectories. This trajectory-guided, reward-driven mechanism enables online adaptation without additional rollouts, directly aligning exploration with policy improvement. On five mathematical reasoning benchmarks, TAMPO outperforms baselines using fixed or heuristic temperatures, establishing temperature as an effective learnable meta-policy for adaptive exploration in LLM reinforcement learning. Accepted at ICLR 2026.

Related papers

Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL [30.357975264905978]
We propose a hierarchical reinforcement learning framework that learns to control sampling temperature during generation.<n>At each decoding step, the model selects a temperature based on its hidden state and samples the next token from the resulting distribution.<n>Temperature and token policies are jointly optimized from downstream rewards using a coordinate ascent scheme.
arXiv Detail & Related papers (2026-02-13T15:42:59Z)
Making Tunable Parameters State-Dependent in Weather and Climate Models with Reinforcement Learning [0.5131152350448099]
This study presents a framework that learns components of parametrisation schemes online.<n>It evaluates the resulting RL-driven parameter updates across a hierarchy of idealised testbeds.<n>Results highlight RL to deliver skilful state-dependent, and regime-aware parametrisations.
arXiv Detail & Related papers (2026-01-07T11:19:16Z)
Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning [29.277754405630205]
Reinforcement learning with verifiable rewards (RLVR) is a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs)<n>Standard fixed-temperature sampling is simple, but it struggles to balance these competing demands, as high temperatures degrade sample quality and low temperatures limit discovery.<n>We propose a simpler and more effective strategy, Exploratory Annealed Decoding (EAD), grounded in the insight that exploration is most impactful on early tokens.
arXiv Detail & Related papers (2025-10-06T18:15:43Z)
Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning [77.92320830700797]
Reinforcement Learning has played a central role in enabling reasoning capabilities of Large Language Models.<n>We propose a tractable computational framework that tracks and leverages curvature information during policy updates.<n>The algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out.
arXiv Detail & Related papers (2025-10-01T12:29:32Z)
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs [67.55973229034319]
This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks.<n>We show that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2025-09-22T17:30:15Z)
Optimizing Temperature for Language Models with Multi-Sample Inference [47.14991144052361]
This paper addresses the challenge of automatically identifying the (near)-optimal temperature for different large language models.<n>We provide a comprehensive analysis of temperature's role in performance optimization, considering variations in model architectures, datasets, task types, model sizes, and predictive accuracy.<n>We propose a novel entropy-based metric for automated temperature optimization, which consistently outperforms fixed-temperature baselines.
arXiv Detail & Related papers (2025-02-07T19:35:25Z)
Adaptive Decoding via Latent Preference Optimization [55.70602730588745]
We introduce Adaptive Decoding, a layer added to the model to select the sampling temperature dynamically at inference time. Our method outperforms all fixed decoding temperatures across a range of tasks that require different temperatures.
arXiv Detail & Related papers (2024-11-14T18:31:39Z)
Extremum-Seeking Action Selection for Accelerating Policy Optimization [18.162794442835413]
Reinforcement learning for control over continuous spaces typically uses high-entropy policies, such as Gaussian distributions, for local exploration and estimating policy to optimize performance. We propose to improve action selection in this model-free RL setting by introducing additional adaptive control steps based on Extremum-Seeking Control (ESC) Our methods can be easily added in standard policy optimization to improve learning efficiency, which we demonstrate in various control learning environments.
arXiv Detail & Related papers (2024-04-02T02:39:17Z)
Global Convergence of Policy Gradient for Linear-Quadratic Mean-Field Control/Game in Continuous Time [109.06623773924737]
We study the policy gradient method for the linear-quadratic mean-field control and game. We show that it converges to the optimal solution at a linear rate, which is verified by a synthetic simulation.
arXiv Detail & Related papers (2020-08-16T06:34:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.