Bayesian Soft Actor-Critic: A Directed Acyclic Strategy Graph Based Deep
Reinforcement Learning
- URL: http://arxiv.org/abs/2208.06033v2
- Date: Mon, 4 Dec 2023 15:35:55 GMT
- Title: Bayesian Soft Actor-Critic: A Directed Acyclic Strategy Graph Based Deep
Reinforcement Learning
- Authors: Qin Yang, Ramviyas Parasuraman
- Abstract summary: This paper proposes a novel directed acyclic strategy graph decomposition approach based on Bayesian chaining.
We integrate this approach into the state-of-the-art DRL method -- soft actor-critic (SAC)
We build the corresponding Bayesian soft actor-critic (BSAC) model by organizing several sub-policies as a joint policy.
- Score: 1.8220718426493654
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adopting reasonable strategies is challenging but crucial for an intelligent
agent with limited resources working in hazardous, unstructured, and dynamic
environments to improve the system's utility, decrease the overall cost, and
increase mission success probability. This paper proposes a novel directed
acyclic strategy graph decomposition approach based on Bayesian chaining to
separate an intricate policy into several simple sub-policies and organize
their relationships as Bayesian strategy networks (BSN). We integrate this
approach into the state-of-the-art DRL method -- soft actor-critic (SAC), and
build the corresponding Bayesian soft actor-critic (BSAC) model by organizing
several sub-policies as a joint policy. We compare our method against the
state-of-the-art deep reinforcement learning algorithms on the standard
continuous control benchmarks in the OpenAI Gym environment. The results
demonstrate that the promising potential of the BSAC method significantly
improves training efficiency.
Related papers
- Evolutionary Strategy Guided Reinforcement Learning via MultiBuffer
Communication [0.0]
We introduce a new Evolutionary Reinforcement Learning model which combines a particular family of Evolutionary algorithm called Evolutionary Strategies with the off-policy Deep Reinforcement Learning algorithm TD3.
The proposed algorithm is demonstrated to perform competitively with current Evolutionary Reinforcement Learning algorithms on MuJoCo control tasks.
arXiv Detail & Related papers (2023-06-20T13:41:57Z) - Context-Aware Bayesian Network Actor-Critic Methods for Cooperative
Multi-Agent Reinforcement Learning [7.784991832712813]
We introduce a Bayesian network to inaugurate correlations between agents' action selections in their joint policy.
We develop practical algorithms to learn the context-aware Bayesian network policies.
Empirical results on a range of MARL benchmarks show the benefits of our approach.
arXiv Detail & Related papers (2023-06-02T21:22:27Z) - Strategy Synthesis in Markov Decision Processes Under Limited Sampling
Access [3.441021278275805]
In environments modeled by gray-box Markov decision processes (MDPs), the impact of the agents' actions are known in terms of successor states but not the synthesiss involved.
In this paper, we devise a strategy algorithm for gray-box MDPs via reinforcement learning that utilizes interval MDPs as internal model.
arXiv Detail & Related papers (2023-03-22T16:58:44Z) - A Strategy-Oriented Bayesian Soft Actor-Critic Model [1.52292571922932]
This paper proposes a novel hierarchical strategy decomposition approach based on the Bayesian chain rule.
We integrate this approach into the state-of-the-art DRL method -- soft actor-critic (SAC) and build the corresponding Bayesian soft actor-critic (BSAC) model.
We compare the proposed BSAC method with the SAC and other state-of-the-art approaches such as TD3, DDPG, and PPO on the standard continuous control benchmarks.
arXiv Detail & Related papers (2023-03-07T19:31:25Z) - Diverse Policy Optimization for Structured Action Space [59.361076277997704]
We propose Diverse Policy Optimization (DPO) to model the policies in structured action space as the energy-based models (EBM)
A novel and powerful generative model, GFlowNet, is introduced as the efficient, diverse EBM-based policy sampler.
Experiments on ATSC and Battle benchmarks demonstrate that DPO can efficiently discover surprisingly diverse policies.
arXiv Detail & Related papers (2023-02-23T10:48:09Z) - A Regularized Implicit Policy for Offline Reinforcement Learning [54.7427227775581]
offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment.
We propose a framework that supports learning a flexible yet well-regularized fully-implicit policy.
Experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.
arXiv Detail & Related papers (2022-02-19T20:22:04Z) - Soft Actor-Critic with Cross-Entropy Policy Optimization [0.45687771576879593]
We propose Soft Actor-Critic with Cross-Entropy Policy Optimization (SAC-CEPO)
SAC-CEPO uses Cross-Entropy Method (CEM) to optimize the policy network of SAC.
We show that SAC-CEPO achieves competitive performance against the original SAC.
arXiv Detail & Related papers (2021-12-21T11:38:12Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - A Deep Reinforcement Learning Approach to Marginalized Importance
Sampling with the Successor Representation [61.740187363451746]
Marginalized importance sampling (MIS) measures the density ratio between the state-action occupancy of a target policy and that of a sampling distribution.
We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy.
We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.
arXiv Detail & Related papers (2021-06-12T20:21:38Z) - Multi-Objective SPIBB: Seldonian Offline Policy Improvement with Safety
Constraints in Finite MDPs [71.47895794305883]
We study the problem of Safe Policy Improvement (SPI) under constraints in the offline Reinforcement Learning setting.
We present an SPI for this RL setting that takes into account the preferences of the algorithm's user for handling the trade-offs for different reward signals.
arXiv Detail & Related papers (2021-05-31T21:04:21Z) - SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep
Reinforcement Learning [102.78958681141577]
We present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy deep reinforcement learning algorithms.
SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration.
arXiv Detail & Related papers (2020-07-09T17:08:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.