Related papers: ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism

ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism

URL: http://arxiv.org/abs/2508.11356v2
Date: Fri, 29 Aug 2025 08:04:20 GMT
Title: ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism
Authors: Jia Liu, ChangYi He, YingQiao Lin, MingMin Yang, FeiYang Shen, ShaoGuo Liu,
Abstract summary: We introduce an entropy-based mechanism to enhance the exploration-exploitation balance in test-time reinforcement learning.<n>Compared with the baseline, our approach enables Llama3.1-8B to achieve a 68 percent relative improvement in Pass at 1 metric.
Score: 10.913346263482786
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in Large Language Models have yielded significant improvements in complex reasoning tasks such as mathematics and programming. However, these models remain heavily dependent on annotated data and exhibit limited adaptability in unsupervised scenarios. To address these limitations, test-time reinforcement learning (TTRL) has been proposed, which enables self-optimization by leveraging model-generated pseudo-labels. Despite its promise, TTRL faces several key challenges, including high inference costs due to parallel rollouts and early-stage estimation bias that fosters overconfidence, reducing output diversity and causing performance plateaus. To address these challenges, we introduce an entropy-based mechanism to enhance the exploration-exploitation balance in test-time reinforcement learning through two strategies: Entropy-fork Tree Majority Rollout (ETMR) and Entropy-based Advantage Reshaping (EAR). Compared with the baseline, our approach enables Llama3.1-8B to achieve a 68 percent relative improvement in Pass at 1 metric on the AIME 2024 benchmark, while consuming only 60 percent of the rollout tokens budget. This highlights our method's ability to effectively optimize the trade-off between inference efficiency, diversity, and estimation robustness, thereby advancing unsupervised reinforcement learning for open-domain reasoning tasks.

Related papers

Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability [129.1296673737603]
Large language models (LLMs) have demonstrated strong reasoning capabilities through step-by-step chain-of-thought (CoT) reasoning.<n>A potential alternative is divide-and-conquer (DAC) reasoning, which decomposes a complex problem into subproblems to facilitate more effective exploration of the solution.<n>We propose an end-to-end reinforcement learning (RL) framework to enhance their DAC-style reasoning capacity.
arXiv Detail & Related papers (2026-02-02T18:54:54Z)
Revisiting Entropy in Reinforcement Learning for Large Reasoning Models [54.96908589622163]
We investigate the entropy dynamics of large language models trained withReinforcement learning with verifiable rewards (RLVR)<n>Our findings reveal that the number of off-policy updates, the diversity of training data, and the clipping thresholds in the optimization objective are critical factors influencing the entropy of LLMs trained with RLVR.
arXiv Detail & Related papers (2025-11-08T12:50:41Z)
DARO: Difficulty-Aware Reweighting Policy Optimization [18.07946696398167]
Group Relative Policy Optimization ( GRPO) has emerged as the de facto approach for Reinforcement Learning with Verifiable Rewards (RLVR)<n>We provide a unified view, demonstrating that their reliance on static or overly simplistic weighting schemes tied to sample difficulty prevents adaptation to a model's evolving capabilities.<n>We introduce bfbfDifficulty-Aware Reweighting Policy Optimization (DARO), a method that dynamically adjusts the loss contribution of each difficulty group based on the model's learning state.
arXiv Detail & Related papers (2025-10-10T04:57:15Z)
The Path of Self-Evolving Large Language Models: Achieving Data-Efficient Learning via Intrinsic Feedback [51.144727949988436]
Reinforcement learning (RL) has demonstrated potential to enhance the reasoning capabilities of large language models (LLMs)<n>In this work, we explore improving LLMs through RL with minimal data.<n>To minimize data dependency, we introduce two novel mechanisms grounded in self-awareness.
arXiv Detail & Related papers (2025-10-03T06:32:10Z)
CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs [53.749193998004166]
Curriculum learning plays a crucial role in enhancing the training efficiency of large language models.<n>We propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead.
arXiv Detail & Related papers (2025-10-01T15:41:27Z)
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [86.30192066451256]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z)
Hierarchical Budget Policy Optimization for Adaptive Reasoning [49.621779447691665]
We present Hierarchical Budget Policy Optimization (HBPO), a reinforcement learning framework that enables models to learn problem-specific reasoning depths without sacrificing capability.<n>HBPO partitions the exploration space into budget-constrained hierarchies (512-2560 tokens), each with differentiated reward structures that preserve both efficiency incentives and reasoning capabilities.<n>Extensive experiments demonstrate that HBPO reduces average token usage by up to 60.6% while improving accuracy by 3.14% across four reasoning benchmarks.
arXiv Detail & Related papers (2025-07-21T17:52:34Z)
LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization [48.91511514636768]
Length-Adaptive Policy Optimization transforms reasoning length control from an external constraint into an intrinsic model capability.<n>LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process.<n> Experiments on mathematical reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9% while improving accuracy by 2.3%.
arXiv Detail & Related papers (2025-07-21T16:14:41Z)
SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning [20.442971494407896]
Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet the optimal integration of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge.<n>We propose Supervised Reinforcement Fine-Tuning (SRFT), a single-stage method that unifies both fine-tuning paradigms through entropy-aware weighting mechanisms.<n>Extensive experiments show that SRFT achieves 59.1% average accuracy, outperforming zero-RL methods by 9.0% on five mathematical reasoning benchmarks and 10.9% on three out-of-distribution benchmarks.
arXiv Detail & Related papers (2025-06-24T16:31:37Z)
Exploring and Exploiting the Inherent Efficiency within Large Reasoning Models for Self-Guided Efficiency Enhancement [101.77467538102924]
Large reasoning models (LRMs) exhibit overthinking, which hinders efficiency and inflates inference cost.<n>We propose two lightweight methods to enhance LRM efficiency.<n>First, we introduce Efficiency Steering, a training-free activation steering technique that modulates reasoning behavior via a single direction.<n>Second, we develop Self-Rewarded Efficiency RL, a reinforcement learning framework that dynamically balances task accuracy and brevity.
arXiv Detail & Related papers (2025-06-18T17:18:12Z)
Navigate the Unknown: Enhancing LLM Reasoning with Intrinsic Motivation Guided Exploration [39.460202867967006]
We propose an Intrinsic Motivation guidEd exploratioN meThOd foR LLM Reasoning (i-MENTOR) to deliver dense rewards and amplify exploration in the RL-based paradigm.<n>Experiments across 4 public datasets demonstrate i-MENTOR's effectiveness, achieving a 22.23% improvement on AIME 2024.
arXiv Detail & Related papers (2025-05-23T08:30:28Z)
The Role of Deductive and Inductive Reasoning in Large Language Models [35.43513487137371]
We propose the Deductive and InDuctive(DID) method to enhance Large Language Models (LLMs) reasoning.<n>DID implements a dual-metric complexity evaluation system that combines Littlestone dimension and information entropy.<n>Our results demonstrate significant improvements in reasoning quality and solution accuracy.
arXiv Detail & Related papers (2024-10-03T18:30:47Z)
Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation [16.350747493026432]
The Chain-of-Thought (CoT) paradigm has emerged as a critical approach for enhancing the reasoning capabilities of large language models (LLMs) We propose the textbfStrategic Chain-of-Thought (SCoT) to refine LLM performance by integrating strategic knowledge prior to generating intermediate reasoning steps. SCoT employs a two-stage approach within a single prompt: first eliciting an effective problem-solving strategy, which is then used to guide the generation of high-quality CoT paths and final answers.
arXiv Detail & Related papers (2024-09-05T06:28:05Z)
Variational Delayed Policy Optimization [25.668512485348952]
In environments with delayed observation, state augmentation by including actions within the delay window is adopted to retrieve Markovian property to enable reinforcement learning (RL) State-of-the-art (SOTA) RL techniques with Temporal-Difference (TD) learning frameworks often suffer from learning inefficiency, due to the significant expansion of the augmented state space with the delay. This work introduces a novel framework called Variational Delayed Policy Optimization (VDPO), which reformulates delayed RL as a variational inference problem.
arXiv Detail & Related papers (2024-05-23T06:57:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.