Related papers: Entropy-guided sequence weighting for efficient exploration in RL-based LLM fine-tuning

Entropy-guided sequence weighting for efficient exploration in RL-based LLM fine-tuning

URL: http://arxiv.org/abs/2503.22456v2
Date: Mon, 31 Mar 2025 10:13:48 GMT
Title: Entropy-guided sequence weighting for efficient exploration in RL-based LLM fine-tuning
Authors: Abdullah Vanlioglu,
Abstract summary: Entropy-Guided Sequence Weighting (EGSW) is a novel approach that enhances the exploration-exploitation tradeoff.<n> EGSW integrates entropy regularization with advantage-based weighting to balance policy updates.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Entropy-Guided Sequence Weighting (EGSW), a novel approach that enhances the exploration-exploitation tradeoff by dynamically assigning weights to generated outputs based on their advantage and entropy for Reinforcement Learning-based Large Language Model fine-tuning. EGSW integrates entropy regularization with advantage-based weighting to balance policy updates, enabling efficient exploration in high-dimensional state spaces. By employing temperature-scaled softmax weighting over sequences, EGSW prioritizing high-reward, high-uncertainty steps while maintaining training stability. Although originally developed to improve Group Relative Policy Optimization (GRPO) during large language model (LLM) fine-tuning, EGSW is generalizable to other reinforcement learning (RL) algorithms and can be implemented in both step-wise and trajectory-wise settings. Empirical evaluations demonstrate that EGSW enhances GRPO reasoning ability, yielding improvements in sample efficiency. Future work will explore the application of EGSW to advanced RL methodologies.

Related papers

Evolutionary Policy Optimization [47.30139909878251]
Current on-policy methods fail to fully leverage the benefits of parallelized environments. EPO is a novel policy gradient algorithm that combines the strengths of EA and policy gradients. We show that EPO significantly improves performance across diverse and challenging environments.
arXiv Detail & Related papers (2025-03-24T18:08:54Z)
Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation [29.579349371114702]
Direct Preference Optimization (DPO) is a cost-effective alternative to reinforcement learning (RL) for large language models (LLMs) We show that a single round of DPO with coarse filtering significantly enhances mathematical reasoning performance. With simple verifiable rewards, our model achieves RL-level performance with significantly lower computational overhead.
arXiv Detail & Related papers (2025-03-17T06:28:25Z)
Adaptive Augmentation Policy Optimization with LLM Feedback [3.038642416291856]
Data augmentation is a critical component of deep learning pipelines, enhancing model generalization by increasing dataset diversity. Traditional augmentation strategies rely on manually designed transformations, sampling, or automated search-based approaches. We propose a Large Language Model (LLM)-guided augmentation optimization strategy that refines augmentation policies based on model performance feedback.
arXiv Detail & Related papers (2024-10-17T11:26:10Z)
TRAWL: Tensor Reduced and Approximated Weights for Large Language Models [11.064868044313855]
We introduce TRAWL (Tensor Reduced and Approximated Weights for Large Language Models), a technique that applies tensor decomposition across multiple weight matrices to effectively denoise LLMs by capturing global structural patterns.<n>Our experiments show that TRAWL improves model performance by up to 16% over baseline models on benchmark datasets, without requiring additional data, training, or fine-tuning.
arXiv Detail & Related papers (2024-06-25T04:01:32Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
Let's reward step by step: Step-Level reward model as the Navigators for Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase. We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs. To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z)
Active RIS-aided EH-NOMA Networks: A Deep Reinforcement Learning Approach [66.53364438507208]
An active reconfigurable intelligent surface (RIS)-aided multi-user downlink communication system is investigated. Non-orthogonal multiple access (NOMA) is employed to improve spectral efficiency, and the active RIS is powered by energy harvesting (EH) An advanced LSTM based algorithm is developed to predict users' dynamic communication state. A DDPG based algorithm is proposed to joint control the amplification matrix and phase shift matrix RIS.
arXiv Detail & Related papers (2023-04-11T13:16:28Z)
Latent Variable Representation for Reinforcement Learning [131.03944557979725]
It remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of model-based reinforcement learning. We provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle. In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models.
arXiv Detail & Related papers (2022-12-17T00:26:31Z)
Progressive extension of reinforcement learning action dimension for asymmetric assembly tasks [7.4642148614421995]
In this paper, a progressive extension of action dimension (PEAD) mechanism is proposed to optimize the convergence of RL algorithms. The results demonstrate the PEAD method will enhance the data-efficiency and time-efficiency of RL algorithms as well as increase the stable reward.
arXiv Detail & Related papers (2021-04-06T11:48:54Z)
Weighted Entropy Modification for Soft Actor-Critic [95.37322316673617]
We generalize the existing principle of the maximum Shannon entropy in reinforcement learning (RL) to weighted entropy by characterizing the state-action pairs with some qualitative weights. We propose an algorithm motivated for self-balancing exploration with the introduced weight function, which leads to state-of-the-art performance on Mujoco tasks despite its simplicity in implementation.
arXiv Detail & Related papers (2020-11-18T04:36:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.