An Entropy Regularization Free Mechanism for Policy-based Reinforcement
Learning
- URL: http://arxiv.org/abs/2106.00707v1
- Date: Tue, 1 Jun 2021 18:04:19 GMT
- Title: An Entropy Regularization Free Mechanism for Policy-based Reinforcement
Learning
- Authors: Changnan Xiao, Haosen Shi, Jiajun Fan, Shihong Deng
- Abstract summary: Policy-based reinforcement learning methods suffer from the policy collapse problem.
We propose an entropy regularization free mechanism that is designed for policy-based methods.
- Score: 1.4566990078034239
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy-based reinforcement learning methods suffer from the policy collapse
problem. We find valued-based reinforcement learning methods with
{\epsilon}-greedy mechanism are capable of enjoying three characteristics,
Closed-form Diversity, Objective-invariant Exploration and Adaptive Trade-off,
which help value-based methods avoid the policy collapse problem. However,
there does not exist a parallel mechanism for policy-based methods that
achieves all three characteristics. In this paper, we propose an entropy
regularization free mechanism that is designed for policy-based methods, which
achieves Closed-form Diversity, Objective-invariant Exploration and Adaptive
Trade-off. Our experiments show that our mechanism is super sample-efficient
for policy-based methods and boosts a policy-based baseline to a new
State-Of-The-Art on Arcade Learning Environment.
Related papers
- SelfBC: Self Behavior Cloning for Offline Reinforcement Learning [14.573290839055316]
We propose a novel dynamic policy constraint that restricts the learned policy on the samples generated by the exponential moving average of previously learned policies.
Our approach results in a nearly monotonically improved reference policy.
arXiv Detail & Related papers (2024-08-04T23:23:48Z) - POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy
Decomposition [40.851324484481275]
We study off-policy learning of contextual bandit policies in large discrete action spaces.
We propose a novel two-stage algorithm, called Policy Optimization via Two-Stage Policy Decomposition.
We show that POTEC provides substantial improvements in OPL effectiveness particularly in large and structured action spaces.
arXiv Detail & Related papers (2024-02-09T03:01:13Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - Learning Control Policies for Variable Objectives from Offline Data [2.7174376960271154]
We introduce a conceptual extension for model-based policy search methods, called variable objective policy (VOP)
We demonstrate that by altering the objectives passed as input to the policy, users gain the freedom to adjust its behavior or re-balance optimization targets at runtime.
arXiv Detail & Related papers (2023-08-11T13:33:59Z) - Policy Dispersion in Non-Markovian Environment [53.05904889617441]
This paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment.
We first adopt a transformer-based method to learn policy embeddings.
Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies.
arXiv Detail & Related papers (2023-02-28T11:58:39Z) - A Regularized Implicit Policy for Offline Reinforcement Learning [54.7427227775581]
offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment.
We propose a framework that supports learning a flexible yet well-regularized fully-implicit policy.
Experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.
arXiv Detail & Related papers (2022-02-19T20:22:04Z) - MPC-based Reinforcement Learning for Economic Problems with Application
to Battery Storage [0.0]
We focus on policy approximations based on Model Predictive Control (MPC)
We observe that the policy gradient method can struggle to produce meaningful steps in the policy parameters when the policy has a (nearly) bang-bang structure.
We propose a homotopy strategy based on the interior-point method, providing a relaxation of the policy during the learning.
arXiv Detail & Related papers (2021-04-06T10:37:14Z) - On Imitation Learning of Linear Control Policies: Enforcing Stability
and Robustness Constraints via LMI Conditions [3.296303220677533]
We formulate the imitation learning of linear policies as a constrained optimization problem.
We show that one can guarantee the closed-loop stability and robustness by posing linear matrix inequality (LMI) constraints on the fitted policy.
arXiv Detail & Related papers (2021-03-24T02:43:03Z) - State Augmented Constrained Reinforcement Learning: Overcoming the
Limitations of Learning with Rewards [88.30521204048551]
A common formulation of constrained reinforcement learning involves multiple rewards that must individually accumulate to given thresholds.
We show a simple example in which the desired optimal policy cannot be induced by any weighted linear combination of rewards.
This work addresses this shortcoming by augmenting the state with Lagrange multipliers and reinterpreting primal-dual methods.
arXiv Detail & Related papers (2021-02-23T21:07:35Z) - Evolutionary Stochastic Policy Distillation [139.54121001226451]
We propose a new method called Evolutionary Policy Distillation (ESPD) to solve GCRS tasks.
ESPD enables a target policy to learn from a series of its variants through the technique of policy distillation (PD)
The experiments based on the MuJoCo control suite show the high learning efficiency of the proposed method.
arXiv Detail & Related papers (2020-04-27T16:19:25Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.