A Strategy-Oriented Bayesian Soft Actor-Critic Model
- URL: http://arxiv.org/abs/2303.04193v1
- Date: Tue, 7 Mar 2023 19:31:25 GMT
- Title: A Strategy-Oriented Bayesian Soft Actor-Critic Model
- Authors: Qin Yang, Ramviyas Parasuraman
- Abstract summary: This paper proposes a novel hierarchical strategy decomposition approach based on the Bayesian chain rule.
We integrate this approach into the state-of-the-art DRL method -- soft actor-critic (SAC) and build the corresponding Bayesian soft actor-critic (BSAC) model.
We compare the proposed BSAC method with the SAC and other state-of-the-art approaches such as TD3, DDPG, and PPO on the standard continuous control benchmarks.
- Score: 1.52292571922932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adopting reasonable strategies is challenging but crucial for an intelligent
agent with limited resources working in hazardous, unstructured, and dynamic
environments to improve the system's utility, decrease the overall cost, and
increase mission success probability. This paper proposes a novel hierarchical
strategy decomposition approach based on the Bayesian chain rule to separate an
intricate policy into several simple sub-policies and organize their
relationships as Bayesian strategy networks (BSN). We integrate this approach
into the state-of-the-art DRL method -- soft actor-critic (SAC) and build the
corresponding Bayesian soft actor-critic (BSAC) model by organizing several
sub-policies as a joint policy. We compare the proposed BSAC method with the
SAC and other state-of-the-art approaches such as TD3, DDPG, and PPO on the
standard continuous control benchmarks -- Hopper-v2, Walker2d-v2, and
Humanoid-v2 -- in MuJoCo with the OpenAI Gym environment. The results
demonstrate that the promising potential of the BSAC method significantly
improves training efficiency.
Related papers
- POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy
Decomposition [40.851324484481275]
We study off-policy learning of contextual bandit policies in large discrete action spaces.
We propose a novel two-stage algorithm, called Policy Optimization via Two-Stage Policy Decomposition.
We show that POTEC provides substantial improvements in OPL effectiveness particularly in large and structured action spaces.
arXiv Detail & Related papers (2024-02-09T03:01:13Z) - Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states.
The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z) - Theoretically Guaranteed Policy Improvement Distilled from Model-Based
Planning [64.10794426777493]
Model-based reinforcement learning (RL) has demonstrated remarkable successes on a range of continuous control tasks.
Recent practices tend to distill optimized action sequences into an RL policy during the training phase.
We develop an approach to distill from model-based planning to the policy.
arXiv Detail & Related papers (2023-07-24T16:52:31Z) - Diverse Policy Optimization for Structured Action Space [59.361076277997704]
We propose Diverse Policy Optimization (DPO) to model the policies in structured action space as the energy-based models (EBM)
A novel and powerful generative model, GFlowNet, is introduced as the efficient, diverse EBM-based policy sampler.
Experiments on ATSC and Battle benchmarks demonstrate that DPO can efficiently discover surprisingly diverse policies.
arXiv Detail & Related papers (2023-02-23T10:48:09Z) - Bayesian Soft Actor-Critic: A Directed Acyclic Strategy Graph Based Deep
Reinforcement Learning [1.8220718426493654]
This paper proposes a novel directed acyclic strategy graph decomposition approach based on Bayesian chaining.
We integrate this approach into the state-of-the-art DRL method -- soft actor-critic (SAC)
We build the corresponding Bayesian soft actor-critic (BSAC) model by organizing several sub-policies as a joint policy.
arXiv Detail & Related papers (2022-08-11T20:36:23Z) - Soft Actor-Critic with Cross-Entropy Policy Optimization [0.45687771576879593]
We propose Soft Actor-Critic with Cross-Entropy Policy Optimization (SAC-CEPO)
SAC-CEPO uses Cross-Entropy Method (CEM) to optimize the policy network of SAC.
We show that SAC-CEPO achieves competitive performance against the original SAC.
arXiv Detail & Related papers (2021-12-21T11:38:12Z) - A Deep Reinforcement Learning Approach to Marginalized Importance
Sampling with the Successor Representation [61.740187363451746]
Marginalized importance sampling (MIS) measures the density ratio between the state-action occupancy of a target policy and that of a sampling distribution.
We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy.
We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.
arXiv Detail & Related papers (2021-06-12T20:21:38Z) - Multi-Objective SPIBB: Seldonian Offline Policy Improvement with Safety
Constraints in Finite MDPs [71.47895794305883]
We study the problem of Safe Policy Improvement (SPI) under constraints in the offline Reinforcement Learning setting.
We present an SPI for this RL setting that takes into account the preferences of the algorithm's user for handling the trade-offs for different reward signals.
arXiv Detail & Related papers (2021-05-31T21:04:21Z) - Semi-On-Policy Training for Sample Efficient Multi-Agent Policy
Gradients [51.749831824106046]
We introduce semi-on-policy (SOP) training as an effective and computationally efficient way to address the sample inefficiency of on-policy policy gradient methods.
We show that our methods perform as well or better than state-of-the-art value-based methods on a variety of SMAC tasks.
arXiv Detail & Related papers (2021-04-27T19:37:01Z) - Decomposed Soft Actor-Critic Method for Cooperative Multi-Agent
Reinforcement Learning [10.64928897082273]
Experimental results demonstrate that mSAC significantly outperforms policy-based approach COMA.
In addition, mSAC achieves pretty good results on large action space tasks, such as 2c_vs_64zg and MMM2.
arXiv Detail & Related papers (2021-04-14T07:02:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.