Exploring Pessimism and Optimism Dynamics in Deep Reinforcement Learning
- URL: http://arxiv.org/abs/2406.03890v1
- Date: Thu, 6 Jun 2024 09:26:02 GMT
- Title: Exploring Pessimism and Optimism Dynamics in Deep Reinforcement Learning
- Authors: Bahareh Tasdighi, Nicklas Werge, Yi-Shan Wu, Melih Kandemir,
- Abstract summary: Off-policy actor-critic algorithms have shown promise in deep reinforcement learning for continuous control tasks.
We introduce Utility Soft Actor-Critic (USAC), a novel framework that enables independent control over the degree of pessimism/optimism for both the actor and the critic.
USAC represents a significant step towards achieving balance within off-policy actor-critic algorithms.
- Score: 13.374594152438691
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy actor-critic algorithms have shown promise in deep reinforcement learning for continuous control tasks. Their success largely stems from leveraging pessimistic state-action value function updates, which effectively address function approximation errors and improve performance. However, such pessimism can lead to under-exploration, constraining the agent's ability to explore/refine its policies. Conversely, optimism can counteract under-exploration, but it also carries the risk of excessive risk-taking and poor convergence if not properly balanced. Based on these insights, we introduce Utility Soft Actor-Critic (USAC), a novel framework within the actor-critic paradigm that enables independent control over the degree of pessimism/optimism for both the actor and the critic via interpretable parameters. USAC adapts its exploration strategy based on the uncertainty of critics through a utility function that allows us to balance between pessimism and optimism separately. By going beyond binary choices of optimism and pessimism, USAC represents a significant step towards achieving balance within off-policy actor-critic algorithms. Our experiments across various continuous control problems show that the degree of pessimism or optimism depends on the nature of the task. Furthermore, we demonstrate that USAC can outperform state-of-the-art algorithms for appropriately configured pessimism/optimism parameters.
Related papers
- DROP: Distributional and Regular Optimism and Pessimism for Reinforcement Learning [6.20048328543366]
This paper introduces a novel theoretically-grounded model with optimism and pessimism, which is derived from control as inference.
Although the model showed poor learning performance, DROP showed excellent one in all tasks with high generality.
arXiv Detail & Related papers (2024-10-22T23:14:09Z) - Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic [48.94340387130627]
Critic-CoT is a framework that pushes LLMs toward System-2-like critic capability.
CoT reasoning paradigm and the automatic construction of distant-supervision data without human annotation.
Experiments on GSM8K and MATH demonstrate that our enhanced model significantly boosts task-solving performance.
arXiv Detail & Related papers (2024-08-29T08:02:09Z) - A Case for Validation Buffer in Pessimistic Actor-Critic [1.5022206231191775]
We show that the critic approximation error can be approximated via a fixed-point model similar to that of the Bellman value.
We propose Validation Pessimism Learning (VPL) algorithm to retrieve the conditions under which the pessimistic critic is unbiased.
VPL uses a small validation buffer to adjust the levels of pessimism throughout the agent training, with the pessimism set such that the approximation error of the critic targets is minimized.
arXiv Detail & Related papers (2024-03-01T22:24:11Z) - Bandit Social Learning: Exploration under Myopic Behavior [54.767961587919075]
We study social learning dynamics motivated by reviews on online platforms.
Agents collectively follow a simple multi-armed bandit protocol, but each agent acts myopically, without regards to exploration.
We derive stark learning failures for any such behavior, and provide matching positive results.
arXiv Detail & Related papers (2023-02-15T01:57:57Z) - Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning
Approach [6.7826352751791985]
We propose a novel pessimism-based Bayesian learning method for optimal dynamic treatment regimes in the offline setting.
We integrate the pessimism principle with Thompson sampling and Bayesian machine learning for optimizing the degree of pessimism.
We develop the computational algorithm based on variational inference, which is highly efficient and scalable.
arXiv Detail & Related papers (2022-10-26T02:14:10Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - On the Optimality of Batch Policy Optimization Algorithms [106.89498352537682]
Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment.
We show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral.
We introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.
arXiv Detail & Related papers (2021-04-06T05:23:20Z) - Deep Reinforcement Learning with Dynamic Optimism [29.806071693039655]
We show that the optimal degree of optimism can vary both across tasks and over the course of learning.
Inspired by this insight, we introduce a novel deep actor-critic algorithm to switch between optimistic and pessimistic value learning online.
arXiv Detail & Related papers (2021-02-07T10:28:09Z) - Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks.
Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes.
We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z) - Reparameterized Variational Divergence Minimization for Stable Imitation [57.06909373038396]
We study the extent to which variations in the choice of probabilistic divergence may yield more performant ILO algorithms.
We contribute a re parameterization trick for adversarial imitation learning to alleviate the challenges of the promising $f$-divergence minimization framework.
Empirically, we demonstrate that our design choices allow for ILO algorithms that outperform baseline approaches and more closely match expert performance in low-dimensional continuous-control tasks.
arXiv Detail & Related papers (2020-06-18T19:04:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.