Related papers: PAC-Bayesian Soft Actor-Critic Learning

PAC-Bayesian Soft Actor-Critic Learning

URL: http://arxiv.org/abs/2301.12776v3
Date: Mon, 10 Jun 2024 12:53:36 GMT
Title: PAC-Bayesian Soft Actor-Critic Learning
Authors: Bahareh Tasdighi, Abdullah Akgül, Manuel Haussmann, Kenny Kazimirzak Brink, Melih Kandemir,
Abstract summary: Actor-critic algorithms address the dual goals of reinforcement learning (RL), policy evaluation and improvement via two separate function approximators. We tackle this bottleneck by employing an existing Probably Approximately Correct (PAC) Bayesian bound for the first time as the critic training objective of the Soft Actor-Critic (SAC) algorithm.
Score: 9.752336113724928
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Actor-critic algorithms address the dual goals of reinforcement learning (RL), policy evaluation and improvement via two separate function approximators. The practicality of this approach comes at the expense of training instability, caused mainly by the destructive effect of the approximation errors of the critic on the actor. We tackle this bottleneck by employing an existing Probably Approximately Correct (PAC) Bayesian bound for the first time as the critic training objective of the Soft Actor-Critic (SAC) algorithm. We further demonstrate that online learning performance improves significantly when a stochastic actor explores multiple futures by critic-guided random search. We observe our resulting algorithm to compare favorably against the state-of-the-art SAC implementation on multiple classical control and locomotion tasks in terms of both sample efficiency and regret.

Related papers

Wasserstein Barycenter Soft Actor-Critic [4.14360329494344]
We propose Wasserstein Barycenter Soft Actor-Critic (WBSAC) algorithm, which benefits from a pessimistic actor for temporal difference learning and an optimistic actor to promote exploration.<n>We show that WBSAC is more sample-efficient on MuJoCo continuous control tasks.
arXiv Detail & Related papers (2025-06-11T20:39:50Z)
IL-SOAR : Imitation Learning with Soft Optimistic Actor cRitic [52.44637913176449]
This paper introduces the SOAR framework for imitation learning. It is an algorithmic template that learns a policy from expert demonstrations with a primal dual style algorithm that alternates cost and policy updates. It is shown to boost consistently the performance of imitation learning algorithms based on Soft Actor Critic such as f-IRL, ML-IRL and CSIL in several MuJoCo environments.
arXiv Detail & Related papers (2025-02-27T08:03:37Z)
ACE : Off-Policy Actor-Critic with Causality-Aware Entropy Regularization [52.5587113539404]
We introduce a causality-aware entropy term that effectively identifies and prioritizes actions with high potential impacts for efficient exploration. Our proposed algorithm, ACE: Off-policy Actor-critic with Causality-aware Entropy regularization, demonstrates a substantial performance advantage across 29 diverse continuous control tasks.
arXiv Detail & Related papers (2024-02-22T13:22:06Z)
Actor Prioritized Experience Replay [0.0]
Prioritized Experience Replay (PER) allows agents to learn from transitions sampled with non-uniform probability proportional to their temporal-difference (TD) error. We introduce a novel experience replay sampling framework for actor-critic methods, which also regards issues with stability and recent findings behind the poor empirical performance of PER. An extensive set of experiments verifies our theoretical claims and demonstrates that the introduced method significantly outperforms the competing approaches.
arXiv Detail & Related papers (2022-09-01T15:27:46Z)
Simultaneous Double Q-learning with Conservative Advantage Learning for Actor-Critic Methods [133.85604983925282]
We propose Simultaneous Double Q-learning with Conservative Advantage Learning (SDQ-CAL) Our algorithm realizes less biased value estimation and achieves state-of-the-art performance in a range of continuous control benchmark tasks.
arXiv Detail & Related papers (2022-05-08T09:17:16Z)
Wasserstein Flow Meets Replicator Dynamics: A Mean-Field Analysis of Representation Learning in Actor-Critic [137.04558017227583]
Actor-critic (AC) algorithms, empowered by neural networks, have had significant empirical success in recent years. We take a mean-field perspective on the evolution and convergence of feature-based neural AC. We prove that neural AC finds the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2021-12-27T06:09:50Z)
Off-policy Reinforcement Learning with Optimistic Exploration and Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework. To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z)
Analysis of a Target-Based Actor-Critic Algorithm with Linear Function Approximation [2.1592777170316366]
Actor-critic methods integrating target networks have exhibited a stupendous empirical success in deep reinforcement learning. We bridge this gap by proposing the first theoretical analysis of an online target-based actor-critic with linear function approximation in the discounted reward setting.
arXiv Detail & Related papers (2021-06-14T14:59:05Z)
Behavior-Guided Actor-Critic: Improving Exploration via Learning Policy Behavior Representation for Deep Reinforcement Learning [0.0]
We propose Behavior-Guided Actor-Critic (BAC) as an off-policy actor-critic deep RL algorithm. BAC mathematically formulates the behavior of the policy through autoencoders. Results show considerably better performances of BAC when compared to several cutting-edge learning algorithms.
arXiv Detail & Related papers (2021-04-09T15:22:35Z)
Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality [131.45028999325797]
We develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP. DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize. We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $epsilon$-accurate optimal policy.
arXiv Detail & Related papers (2021-02-23T18:56:13Z)
How to Learn a Useful Critic? Model-based Action-Gradient-Estimator Policy Optimization [10.424426548124696]
We propose MAGE, a model-based actor-critic algorithm, grounded in the theory of policy gradients. MAGE backpropagates through the learned dynamics to compute gradient targets in temporal difference learning. We demonstrate the efficiency of the algorithm in comparison to model-free and model-based state-of-the-art baselines.
arXiv Detail & Related papers (2020-04-29T16:30:53Z)
Online Meta-Critic Learning for Off-Policy Actor-Critic Methods [107.98781730288897]
Off-Policy Actor-Critic (Off-PAC) methods have proven successful in a variety of continuous control tasks. We introduce a novel and flexible meta-critic that observes the learning process and meta-learns an additional loss for the actor.
arXiv Detail & Related papers (2020-03-11T14:39:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.