Soft Actor-Critic with Cross-Entropy Policy Optimization
- URL: http://arxiv.org/abs/2112.11115v1
- Date: Tue, 21 Dec 2021 11:38:12 GMT
- Title: Soft Actor-Critic with Cross-Entropy Policy Optimization
- Authors: Zhenyang Shi, Surya P.N. Singh
- Abstract summary: We propose Soft Actor-Critic with Cross-Entropy Policy Optimization (SAC-CEPO)
SAC-CEPO uses Cross-Entropy Method (CEM) to optimize the policy network of SAC.
We show that SAC-CEPO achieves competitive performance against the original SAC.
- Score: 0.45687771576879593
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Soft Actor-Critic (SAC) is one of the state-of-the-art off-policy
reinforcement learning (RL) algorithms that is within the maximum entropy based
RL framework. SAC is demonstrated to perform very well in a list of continous
control tasks with good stability and robustness. SAC learns a stochastic
Gaussian policy that can maximize a trade-off between total expected reward and
the policy entropy. To update the policy, SAC minimizes the KL-Divergence
between the current policy density and the soft value function density.
Reparameterization trick is then used to obtain the approximate gradient of
this divergence. In this paper, we propose Soft Actor-Critic with Cross-Entropy
Policy Optimization (SAC-CEPO), which uses Cross-Entropy Method (CEM) to
optimize the policy network of SAC. The initial idea is to use CEM to
iteratively sample the closest distribution towards the soft value function
density and uses the resultant distribution as a target to update the policy
network. For the purpose of reducing the computational complexity, we also
introduce a decoupled policy structure that decouples the Gaussian policy into
one policy that learns the mean and one other policy that learns the deviation
such that only the mean policy is trained by CEM. We show that this decoupled
policy structure does converge to a optimal and we also demonstrate by
experiments that SAC-CEPO achieves competitive performance against the original
SAC.
Related papers
- Diffusion Policy Policy Optimization [37.04382170999901]
Diffusion Policy Optimization, DPPO, is an algorithmic framework for fine-tuning diffusion-based policies.
DPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks.
We show that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization.
arXiv Detail & Related papers (2024-09-01T02:47:50Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Stepwise Alignment for Constrained Language Model Policy Optimization [12.986006070964772]
Safety and trustworthiness are indispensable requirements for real-world applications of AI systems using large language models (LLMs)
This paper formulates human value alignment as an optimization problem of the language model policy to maximize reward under a safety constraint.
One key idea behind SACPO, supported by theory, is that the optimal policy incorporating reward and safety can be directly obtained from a reward-aligned policy.
arXiv Detail & Related papers (2024-04-17T03:44:58Z) - Adversarially Trained Weighted Actor-Critic for Safe Offline Reinforcement Learning [9.94248417157713]
We propose WSAC, a novel algorithm for Safe Offline Reinforcement Learning (RL) under functional approximation.
WSAC is designed as a two-player Stackelberg game to optimize a refined objective function.
arXiv Detail & Related papers (2024-01-01T01:44:58Z) - Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states.
The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z) - Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples with
On-Policy Experience [9.06635747612495]
Soft Actor-Critic (SAC) is an off-policy actor-critic reinforcement learning algorithm.
SAC trains a policy by maximizing the trade-off between expected return and entropy.
It has achieved state-of-the-art performance on a range of continuous-control benchmark tasks.
arXiv Detail & Related papers (2021-09-24T06:46:28Z) - Cautious Policy Programming: Exploiting KL Regularization in Monotonic
Policy Improvement for Reinforcement Learning [11.82492300303637]
We propose a novel value-based reinforcement learning (RL) algorithm that can ensure monotonic policy improvement during learning.
We demonstrate that the proposed algorithm can trade o? performance and stability in both didactic classic control problems and challenging high-dimensional Atari games.
arXiv Detail & Related papers (2021-07-13T01:03:10Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs)
Semi-implicit actor (SIA) powered by a flexible policy distribution.
We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.