Proximal Policy Optimization via Enhanced Exploration Efficiency
- URL: http://arxiv.org/abs/2011.05525v1
- Date: Wed, 11 Nov 2020 03:03:32 GMT
- Title: Proximal Policy Optimization via Enhanced Exploration Efficiency
- Authors: Junwei Zhang, Zhenghao Zhang, Shuai Han, Shuai L\"u
- Abstract summary: Proximal policy optimization (PPO) algorithm is a deep reinforcement learning algorithm with outstanding performance.
This paper analyzes the assumption of the original Gaussian action exploration mechanism in PPO algorithm, and clarifies the influence of exploration ability on performance.
We propose intrinsic exploration module (IEM-PPO) which can be used in complex environments.
- Score: 6.2501569560329555
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Proximal policy optimization (PPO) algorithm is a deep reinforcement learning
algorithm with outstanding performance, especially in continuous control tasks.
But the performance of this method is still affected by its exploration
ability. For classical reinforcement learning, there are some schemes that make
exploration more full and balanced with data exploitation, but they can't be
applied in complex environments due to the complexity of algorithm. Based on
continuous control tasks with dense reward, this paper analyzes the assumption
of the original Gaussian action exploration mechanism in PPO algorithm, and
clarifies the influence of exploration ability on performance. Afterward,
aiming at the problem of exploration, an exploration enhancement mechanism
based on uncertainty estimation is designed in this paper. Then, we apply
exploration enhancement theory to PPO algorithm and propose the proximal policy
optimization algorithm with intrinsic exploration module (IEM-PPO) which can be
used in complex environments. In the experimental parts, we evaluate our method
on multiple tasks of MuJoCo physical simulator, and compare IEM-PPO algorithm
with curiosity driven exploration algorithm (ICM-PPO) and original algorithm
(PPO). The experimental results demonstrate that IEM-PPO algorithm needs longer
training time, but performs better in terms of sample efficiency and cumulative
reward, and has stability and robustness.
Related papers
- Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy.
We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z) - Proximal Policy Optimization with Adaptive Exploration [0.0]
This paper investigates the exploration-exploitation tradeoff within the context of reinforcement learning.
The proposed adaptive exploration framework dynamically adjusts the exploration magnitude during training based on the recent performance of the agent.
arXiv Detail & Related papers (2024-05-07T20:51:49Z) - Surpassing legacy approaches to PWR core reload optimization with single-objective Reinforcement learning [0.0]
We have developed methods based on Deep Reinforcement Learning (DRL) for both single- and multi-objective optimization.
In this paper, we demonstrate the advantage of our RL-based approach, specifically using Proximal Policy Optimization (PPO)
PPO adapts its search capability via a policy with learnable weights, allowing it to function as both a global and local search method.
arXiv Detail & Related papers (2024-02-16T19:35:58Z) - Representation Learning with Multi-Step Inverse Kinematics: An Efficient
and Optimal Approach to Rich-Observation RL [106.82295532402335]
Existing reinforcement learning algorithms suffer from computational intractability, strong statistical assumptions, and suboptimal sample complexity.
We provide the first computationally efficient algorithm that attains rate-optimal sample complexity with respect to the desired accuracy level.
Our algorithm, MusIK, combines systematic exploration with representation learning based on multi-step inverse kinematics.
arXiv Detail & Related papers (2023-04-12T14:51:47Z) - Entropy Augmented Reinforcement Learning [0.0]
We propose a shifted Markov decision process (MDP) to encourage the exploration and reinforce the ability of escaping from suboptimums.
Our experiments test augmented TRPO and PPO on MuJoCo benchmark tasks, of an indication that the agent is heartened towards higher reward regions.
arXiv Detail & Related papers (2022-08-19T13:09:32Z) - Generative Actor-Critic: An Off-policy Algorithm Using the Push-forward
Model [24.030426634281643]
In continuous control tasks, widely used policies with Gaussian distributions results in ineffective exploration of environments.
We propose a density-free off-policy algorithm, Generative Actor-Critic, using the push-forward model to increase the expressiveness of policies.
We show that push-forward policies possess desirable features, such as multi-modality, which can improve the efficiency of exploration and performance of algorithms obviously.
arXiv Detail & Related papers (2021-05-08T16:29:20Z) - The Surprising Effectiveness of MAPPO in Cooperative, Multi-Agent Games [67.47961797770249]
Multi-Agent PPO (MAPPO) is a multi-agent PPO variant which adopts a centralized value function.
We show that MAPPO achieves performance comparable to the state-of-the-art in three popular multi-agent testbeds.
arXiv Detail & Related papers (2021-03-02T18:59:56Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z) - Implementation Matters in Deep Policy Gradients: A Case Study on PPO and
TRPO [90.90009491366273]
We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms.
Specifically, we investigate the consequences of "code-level optimizations:"
Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function.
arXiv Detail & Related papers (2020-05-25T16:24:59Z) - Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension.
We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation.
These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.