Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples with
On-Policy Experience
- URL: http://arxiv.org/abs/2109.11767v1
- Date: Fri, 24 Sep 2021 06:46:28 GMT
- Title: Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples with
On-Policy Experience
- Authors: Chayan Banerjee, Zhiyong Chen, Nasimul Noman
- Abstract summary: Soft Actor-Critic (SAC) is an off-policy actor-critic reinforcement learning algorithm.
SAC trains a policy by maximizing the trade-off between expected return and entropy.
It has achieved state-of-the-art performance on a range of continuous-control benchmark tasks.
- Score: 9.06635747612495
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Soft Actor-Critic (SAC) is an off-policy actor-critic reinforcement learning
algorithm, essentially based on entropy regularization. SAC trains a policy by
maximizing the trade-off between expected return and entropy (randomness in the
policy). It has achieved state-of-the-art performance on a range of
continuous-control benchmark tasks, outperforming prior on-policy and
off-policy methods. SAC works in an off-policy fashion where data are sampled
uniformly from past experiences (stored in a buffer) using which parameters of
the policy and value function networks are updated. We propose certain crucial
modifications for boosting the performance of SAC and make it more sample
efficient. In our proposed improved SAC, we firstly introduce a new
prioritization scheme for selecting better samples from the experience replay
buffer. Secondly we use a mixture of the prioritized off-policy data with the
latest on-policy data for training the policy and the value function networks.
We compare our approach with the vanilla SAC and some recent variants of SAC
and show that our approach outperforms the said algorithmic benchmarks. It is
comparatively more stable and sample efficient when tested on a number of
continuous control tasks in MuJoCo environments.
Related papers
- Soft Actor-Critic with Beta Policy via Implicit Reparameterization Gradients [0.0]
Soft actor-critic (SAC) mitigates poor sample efficiency by combining policy optimization and off-policy learning.
It is limited to distributions whose gradients can be computed through the re parameterization trick.
We extend this technique to train SAC with the beta policy on simulated robot locomotion environments.
Experimental results show that the beta policy is a viable alternative, as it outperforms the normal policy and is on par with the normal policy.
arXiv Detail & Related papers (2024-09-08T04:30:51Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - RLSAC: Reinforcement Learning enhanced Sample Consensus for End-to-End
Robust Estimation [74.47709320443998]
We propose RLSAC, a novel Reinforcement Learning enhanced SAmple Consensus framework for end-to-end robust estimation.
RLSAC employs a graph neural network to utilize both data and memory features to guide exploring directions for sampling the next minimum set.
Our experimental results demonstrate that RLSAC can learn from features to gradually explore a better hypothesis.
arXiv Detail & Related papers (2023-08-10T03:14:19Z) - You May Not Need Ratio Clipping in PPO [117.03368180633463]
Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data.
Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples.
We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios.
We show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.
arXiv Detail & Related papers (2022-01-31T20:26:56Z) - Soft Actor-Critic with Cross-Entropy Policy Optimization [0.45687771576879593]
We propose Soft Actor-Critic with Cross-Entropy Policy Optimization (SAC-CEPO)
SAC-CEPO uses Cross-Entropy Method (CEM) to optimize the policy network of SAC.
We show that SAC-CEPO achieves competitive performance against the original SAC.
arXiv Detail & Related papers (2021-12-21T11:38:12Z) - Replay For Safety [51.11953997546418]
In experience replay, past transitions are stored in a memory buffer and re-used during learning.
We show that using an appropriate biased sampling scheme can allow us to achieve a emphsafe policy.
arXiv Detail & Related papers (2021-12-08T11:10:57Z) - Off-Policy Correction for Deep Deterministic Policy Gradient Algorithms
via Batch Prioritized Experience Replay [0.0]
We develop a novel algorithm, Batch Prioritizing Experience Replay via KL Divergence, which prioritizes batch of transitions.
We combine our algorithm with Deep Deterministic Policy Gradient and Twin Delayed Deep Deterministic Policy Gradient and evaluate it on various continuous control tasks.
arXiv Detail & Related papers (2021-11-02T19:51:59Z) - Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks.
Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes.
We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z) - DDPG++: Striving for Simplicity in Continuous-control Off-Policy
Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled.
Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step.
Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z) - Band-limited Soft Actor Critic Model [15.11069042369131]
Soft Actor Critic (SAC) algorithms show remarkable performance in complex simulated environments.
We take this idea one step further by artificially bandlimiting the target critic spatial resolution.
We derive the closed form solution in the linear case and show that bandlimiting reduces the interdependency between the low frequency components of the state-action value approximation.
arXiv Detail & Related papers (2020-06-19T22:52:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.