Entropy Augmented Reinforcement Learning
- URL: http://arxiv.org/abs/2208.09322v1
- Date: Fri, 19 Aug 2022 13:09:32 GMT
- Title: Entropy Augmented Reinforcement Learning
- Authors: Jianfei Ma
- Abstract summary: We propose a shifted Markov decision process (MDP) to encourage the exploration and reinforce the ability of escaping from suboptimums.
Our experiments test augmented TRPO and PPO on MuJoCo benchmark tasks, of an indication that the agent is heartened towards higher reward regions.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep reinforcement learning has gained a lot of success with the presence of
trust region policy optimization (TRPO) and proximal policy optimization (PPO),
for their scalability and efficiency. However, the pessimism of both
algorithms, among which it either is constrained in a trust region or strictly
excludes all suspicious gradients, has been proven to suppress the exploration
and harm the performance of the agent. To address those issues, we propose a
shifted Markov decision process (MDP), or rather, with entropy augmentation, to
encourage the exploration and reinforce the ability of escaping from
suboptimums. Our method is extensible and adapts to either reward shaping or
bootstrapping. With convergence analysis given, we find it is crucial to
control the temperature coefficient. However, if appropriately tuning it, we
can achieve remarkable performance, even on other algorithms, since it is
simple yet effective. Our experiments test augmented TRPO and PPO on MuJoCo
benchmark tasks, of an indication that the agent is heartened towards higher
reward regions, and enjoys a balance between exploration and exploitation. We
verify the exploration bonus of our method on two grid world environments.
Related papers
- Efficient Reinforcement Learning via Decoupling Exploration and Utilization [6.305976803910899]
Reinforcement Learning (RL) has achieved remarkable success across multiple fields and applications, including gaming, robotics, and autonomous vehicles.
In this work, our aim is to train agent with efficient learning by decoupling exploration and utilization, so that agent can escaping the conundrum of suboptimal Solutions.
The above idea is implemented in the proposed OPARL (Optimistic and Pessimistic Actor Reinforcement Learning) algorithm.
arXiv Detail & Related papers (2023-12-26T09:03:23Z) - Mimicking Better by Matching the Approximate Action Distribution [48.95048003354255]
We introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations.
We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods.
arXiv Detail & Related papers (2023-06-16T12:43:47Z) - PACER: A Fully Push-forward-based Distributional Reinforcement Learning Algorithm [28.48626438603237]
PACER consists of a distributional critic, an actor and a sample-based encourager.
Push-forward operator is leveraged in both the critic and actor to model the return distributions and policies respectively.
A sample-based utility value policy gradient is established for the push-forward policy update.
arXiv Detail & Related papers (2023-06-11T09:45:31Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Efficient Exploration via Epistemic-Risk-Seeking Policy Optimization [8.867416300893577]
Exploration remains a key challenge in deep reinforcement learning (RL)
In this paper we propose a new, differentiable optimistic objective that when optimized yields a policy that provably explores efficiently.
Results show significant performance improvements even over other efficient exploration techniques.
arXiv Detail & Related papers (2023-02-18T14:13:25Z) - Robust Policy Optimization in Deep Reinforcement Learning [16.999444076456268]
In continuous action domains, parameterized distribution of action distribution allows easy control of exploration.
In particular, we propose an algorithm called Robust Policy Optimization (RPO), which leverages a perturbed distribution.
We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym.
arXiv Detail & Related papers (2022-12-14T22:43:56Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Proximal Policy Optimization via Enhanced Exploration Efficiency [6.2501569560329555]
Proximal policy optimization (PPO) algorithm is a deep reinforcement learning algorithm with outstanding performance.
This paper analyzes the assumption of the original Gaussian action exploration mechanism in PPO algorithm, and clarifies the influence of exploration ability on performance.
We propose intrinsic exploration module (IEM-PPO) which can be used in complex environments.
arXiv Detail & Related papers (2020-11-11T03:03:32Z) - Implementation Matters in Deep Policy Gradients: A Case Study on PPO and
TRPO [90.90009491366273]
We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms.
Specifically, we investigate the consequences of "code-level optimizations:"
Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function.
arXiv Detail & Related papers (2020-05-25T16:24:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.