Hindsight Experience Replay with Kronecker Product Approximate Curvature
- URL: http://arxiv.org/abs/2010.06142v1
- Date: Fri, 9 Oct 2020 20:25:14 GMT
- Title: Hindsight Experience Replay with Kronecker Product Approximate Curvature
- Authors: Dhuruva Priyan G M, Abhik Singla, Shalabh Bhatnagar
- Abstract summary: Hindsight Experience Replay (HER) is one of the efficient algorithm to solve Reinforcement Learning tasks.
But due to its reduced sample efficiency and slower convergence HER fails to perform effectively.
Natural gradients solves these challenges by converging the model parameters better.
Our proposed method solves the above mentioned challenges with better sample efficiency and faster convergence with increased success rate.
- Score: 5.441932327359051
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hindsight Experience Replay (HER) is one of the efficient algorithm to solve
Reinforcement Learning tasks related to sparse rewarded environments.But due to
its reduced sample efficiency and slower convergence HER fails to perform
effectively. Natural gradients solves these challenges by converging the model
parameters better. It avoids taking bad actions that collapse the training
performance. However updating parameters in neural networks requires expensive
computation and thus increase in training time. Our proposed method solves the
above mentioned challenges with better sample efficiency and faster convergence
with increased success rate. A common failure mode for DDPG is that the learned
Q-function begins to dramatically overestimate Q-values, which then leads to
the policy breaking, because it exploits the errors in the Q-function. We solve
this issue by including Twin Delayed Deep Deterministic Policy Gradients(TD3)
in HER. TD3 learns two Q-functions instead of one and it adds noise tothe
target action, to make it harder for the policy to exploit Q-function errors.
The experiments are done with the help of OpenAis Mujoco environments. Results
on these environments show that our algorithm (TDHER+KFAC) performs better
inmost of the scenarios
Related papers
- Near-Optimal Solutions of Constrained Learning Problems [85.48853063302764]
In machine learning systems, the need to curtail their behavior has become increasingly apparent.
This is evidenced by recent advancements towards developing models that satisfy dual robustness variables.
Our results show that rich parametrizations effectively mitigate non-dimensional, finite learning problems.
arXiv Detail & Related papers (2024-03-18T14:55:45Z) - Understanding, Predicting and Better Resolving Q-Value Divergence in
Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL.
We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training.
For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z) - Planning for Sample Efficient Imitation Learning [52.44953015011569]
Current imitation algorithms struggle to achieve high performance and high in-environment sample efficiency simultaneously.
We propose EfficientImitate, a planning-based imitation learning method that can achieve high in-environment sample efficiency and performance simultaneously.
Experimental results show that EI achieves state-of-the-art results in performance and sample efficiency.
arXiv Detail & Related papers (2022-10-18T05:19:26Z) - M$^2$DQN: A Robust Method for Accelerating Deep Q-learning Network [6.689964384669018]
We propose a framework which uses the Max-Mean loss in Deep Q-Network (M$2$DQN)
Instead of sampling one batch of experiences in the training step, we sample several batches from the experience replay and update the parameters such as the maximum TD-error of these batches is minimized.
We verify the effectiveness of this framework with one of the most widely used techniques, Double DQN (DDQN) in several gym games.
arXiv Detail & Related papers (2022-09-16T09:20:35Z) - Simultaneous Double Q-learning with Conservative Advantage Learning for
Actor-Critic Methods [133.85604983925282]
We propose Simultaneous Double Q-learning with Conservative Advantage Learning (SDQ-CAL)
Our algorithm realizes less biased value estimation and achieves state-of-the-art performance in a range of continuous control benchmark tasks.
arXiv Detail & Related papers (2022-05-08T09:17:16Z) - Optimizing the Long-Term Behaviour of Deep Reinforcement Learning for
Pushing and Grasping [0.0]
We investigate the capabilities of two systems to learn long-term rewards and policies.
Ewerton et al. attain their best performance using an agent which only takes the most immediate action under consideration.
We show that this approach enables the models to accurately predict long-term action sequences when trained with large discount factors.
arXiv Detail & Related papers (2022-04-07T15:02:44Z) - Can Q-learning solve Multi Armed Bantids? [0.0]
We show that current reinforcement learning algorithms are not capable of solving Multi-Armed-Bandit problems.
This stems from variance differences between policies, which causes two problems.
We propose the Adaptive Symmetric Reward Noising (ASRN) method, by which we mean equalizing the rewards variance across different policies.
arXiv Detail & Related papers (2021-10-21T07:08:30Z) - An Improved Algorithm of Robot Path Planning in Complex Environment
Based on Double DQN [4.161177874372099]
This paper proposes an improved Double DQN (DDQN) to solve the problem by reference to A* and Rapidly-Exploring Random Tree (RRT)
The simulation experimental results validate the efficiency of the improved DDQN.
arXiv Detail & Related papers (2021-07-23T14:03:04Z) - DDPG++: Striving for Simplicity in Continuous-control Off-Policy
Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled.
Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step.
Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z) - DisCor: Corrective Feedback in Reinforcement Learning via Distribution
Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback.
We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.