Related papers: Diversity Actor-Critic: Sample-Aware Entropy Regularization for Sample-Efficient Exploration

Diversity Actor-Critic: Sample-Aware Entropy Regularization for Sample-Efficient Exploration

URL: http://arxiv.org/abs/2006.01419v2
Date: Wed, 9 Jun 2021 03:05:50 GMT
Title: Diversity Actor-Critic: Sample-Aware Entropy Regularization for Sample-Efficient Exploration
Authors: Seungyul Han, Youngchul Sung
Abstract summary: Exploiting the sample distribution obtainable from the replay buffer, the proposed sample-aware entropy regularization maximizes the entropy of the weighted sum of the policy action distribution and the sample action distribution from the replay buffer for sample-efficient exploration. A practical algorithm named diversity actor-critic (DAC) is developed by applying policy iteration to the objective function with the proposed sample-aware entropy regularization.
Score: 22.539300644593936
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, sample-aware policy entropy regularization is proposed to enhance the conventional policy entropy regularization for better exploration. Exploiting the sample distribution obtainable from the replay buffer, the proposed sample-aware entropy regularization maximizes the entropy of the weighted sum of the policy action distribution and the sample action distribution from the replay buffer for sample-efficient exploration. A practical algorithm named diversity actor-critic (DAC) is developed by applying policy iteration to the objective function with the proposed sample-aware entropy regularization. Numerical results show that DAC significantly outperforms existing recent algorithms for reinforcement learning.

Related papers

AEGPO: Adaptive Entropy-Guided Policy Optimization for Diffusion Models [54.56296715999545]
Reinforcement learning from human feedback shows promise for aligning diffusion and flow models.<n>Policy optimization methods such as GRPO suffer from inefficient and static sampling strategies.<n>We propose Adaptive Entropy-Guided Policy Optimization (AEGPO), a novel dual-signal, dual-level adaptive optimization strategy.
arXiv Detail & Related papers (2026-02-06T16:09:50Z)
Boosting RL-Based Visual Reasoning with Selective Adversarial Entropy Intervention [29.17663968152743]
We propose entropy-guided adversarial sampling (EgAS) that formu- lates the entropy of sampled responses as an adversarial ob- jective.<n>The corresponding adversarial gradient can be used to attack the visual input for producing adversarial samples, allowing the policy model to explore a larger an- swer space during RL sampling.
arXiv Detail & Related papers (2025-12-11T08:27:02Z)
Reinforced sequential Monte Carlo for amortised sampling [49.92678178064033]
We state a connection between sequential Monte Carlo (SMC) and neural sequential samplers trained by maximum-entropy reinforcement learning (MaxEnt RL)<n>We describe techniques for stable joint training of proposals and twist functions and an adaptive weight tempering scheme to reduce training signal variance.
arXiv Detail & Related papers (2025-10-13T17:59:11Z)
G$^2$RPO: Granular GRPO for Precise Reward in Flow Models [74.21206048155669]
We propose a novel Granular-GRPO (G$2$RPO) framework that achieves precise and comprehensive reward assessments of sampling directions.<n>We introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales.<n>Our G$2$RPO significantly outperforms existing flow-based GRPO baselines.
arXiv Detail & Related papers (2025-10-02T12:57:12Z)
Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift. Current approaches typically address this issue through online sampling from the target policy. We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Discrete Probabilistic Inference as Control in Multi-path Environments [84.67055173040107]
We consider the problem of sampling from a discrete and structured distribution as a sequential decision problem. We show that GFlowNets learn a policy that samples objects proportionally to their reward by enforcing a conservation of flows. We also prove that some flow-matching objectives found in the GFlowNet literature are in fact equivalent to well-established MaxEnt RL algorithms with a corrected reward.
arXiv Detail & Related papers (2024-02-15T20:20:35Z)
Improved off-policy training of diffusion samplers [93.66433483772055]
We study the problem of training diffusion models to sample from a distribution with an unnormalized density or energy function. We benchmark several diffusion-structured inference methods, including simulation-based variational approaches and off-policy methods. Our results shed light on the relative advantages of existing algorithms while bringing into question some claims from past work.
arXiv Detail & Related papers (2024-02-07T18:51:49Z)
Frugal Actor-Critic: Sample Efficient Off-Policy Deep Reinforcement Learning Using Unique Experiences [8.983448736644382]
Efficient utilization of the replay buffer plays a significant role in the off-policy actor-critic reinforcement learning (RL) algorithms. We propose a method for achieving sample efficiency, which focuses on selecting unique samples and adding them to the replay buffer.
arXiv Detail & Related papers (2024-02-05T10:04:00Z)
Sample Dropout: A Simple yet Effective Variance Reduction Technique in Deep Policy Optimization [18.627233013208834]
We show that the use of importance sampling could introduce high variance in the objective estimate. We propose a technique called sample dropout to bound the estimation variance by dropping out samples when their ratio deviation is too high.
arXiv Detail & Related papers (2023-02-05T04:44:35Z)
Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment. Policy gradients for local search are often obtained from random perturbations. We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z)
Optimal Off-Policy Evaluation from Multiple Logging Policies [77.62012545592233]
We study off-policy evaluation from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling. We find the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one.
arXiv Detail & Related papers (2020-10-21T13:43:48Z)
Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning. Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z)
Adaptive Experience Selection for Policy Gradient [8.37609145576126]
Experience replay is a commonly used approach to improve sample efficiency. gradient estimators using past trajectories typically have high variance. Existing sampling strategies for experience replay like uniform sampling or prioritised experience replay do not explicitly try to control the variance of the gradient estimates. We propose an online learning algorithm, adaptive experience selection (AES), to adaptively learn an experience sampling distribution that explicitly minimises this variance.
arXiv Detail & Related papers (2020-02-17T13:16:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.