Deep Reinforcement Learning with Dynamic Optimism
- URL: http://arxiv.org/abs/2102.03765v2
- Date: Tue, 9 Feb 2021 09:29:55 GMT
- Title: Deep Reinforcement Learning with Dynamic Optimism
- Authors: Ted Moskovitz, Jack Parker-Holder, Aldo Pacchiano, Michael Arbel
- Abstract summary: We show that the optimal degree of optimism can vary both across tasks and over the course of learning.
Inspired by this insight, we introduce a novel deep actor-critic algorithm to switch between optimistic and pessimistic value learning online.
- Score: 29.806071693039655
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, deep off-policy actor-critic algorithms have become a
dominant approach to reinforcement learning for continuous control. This comes
after a series of breakthroughs to address function approximation errors, which
previously led to poor performance. These insights encourage the use of
pessimistic value updates. However, this discourages exploration and runs
counter to theoretical support for the efficacy of optimism in the face of
uncertainty. So which approach is best? In this work, we show that the optimal
degree of optimism can vary both across tasks and over the course of learning.
Inspired by this insight, we introduce a novel deep actor-critic algorithm,
Dynamic Optimistic and Pessimistic Estimation (DOPE) to switch between
optimistic and pessimistic value learning online by formulating the selection
as a multi-arm bandit problem. We show in a series of challenging continuous
control tasks that DOPE outperforms existing state-of-the-art methods, which
rely on a fixed degree of optimism. Since our changes are simple to implement,
we believe these insights can be extended to a number of off-policy algorithms.
Related papers
- Exploring Pessimism and Optimism Dynamics in Deep Reinforcement Learning [13.374594152438691]
Off-policy actor-critic algorithms have shown promise in deep reinforcement learning for continuous control tasks.
We introduce Utility Soft Actor-Critic (USAC), a novel framework that enables independent control over the degree of pessimism/optimism for both the actor and the critic.
USAC represents a significant step towards achieving balance within off-policy actor-critic algorithms.
arXiv Detail & Related papers (2024-06-06T09:26:02Z) - Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning
Approach [6.7826352751791985]
We propose a novel pessimism-based Bayesian learning method for optimal dynamic treatment regimes in the offline setting.
We integrate the pessimism principle with Thompson sampling and Bayesian machine learning for optimizing the degree of pessimism.
We develop the computational algorithm based on variational inference, which is highly efficient and scalable.
arXiv Detail & Related papers (2022-10-26T02:14:10Z) - Pessimistic Off-Policy Optimization for Learning to Rank [13.733459243449634]
Off-policy learning is a framework for optimizing policies without deploying them.
In recommender systems, this is especially challenging due to the imbalance in logged data.
We study pessimistic off-policy optimization for learning to rank.
arXiv Detail & Related papers (2022-06-06T12:58:28Z) - Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds
for Episodic Reinforcement Learning [50.44564503645015]
We provide improved gap-dependent regret bounds for reinforcement learning in finite episodic Markov decision processes.
We prove tighter upper regret bounds for optimistic algorithms and accompany them with new information-theoretic lower bounds for a large class of MDPs.
arXiv Detail & Related papers (2021-07-02T20:36:05Z) - Emphatic Algorithms for Deep Reinforcement Learning [43.17171330951343]
Temporal difference learning algorithms can become unstable when combined with function approximation and off-policy sampling.
Emphatic temporal difference (ETD($lambda$) algorithm ensures convergence in the linear case by appropriately weighting the TD($lambda$) updates.
We show that naively adapting ETD($lambda$) to popular deep reinforcement learning algorithms, which use forward view multi-step returns, results in poor performance.
arXiv Detail & Related papers (2021-06-21T12:11:39Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Robust Value Iteration for Continuous Control Tasks [99.00362538261972]
When transferring a control policy from simulation to a physical system, the policy needs to be robust to variations in the dynamics to perform well.
We present Robust Fitted Value Iteration, which uses dynamic programming to compute the optimal value function on the compact state domain.
We show that robust value is more robust compared to deep reinforcement learning algorithm and the non-robust version of the algorithm.
arXiv Detail & Related papers (2021-05-25T19:48:35Z) - On the Optimality of Batch Policy Optimization Algorithms [106.89498352537682]
Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment.
We show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral.
We introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.
arXiv Detail & Related papers (2021-04-06T05:23:20Z) - Inverse Reinforcement Learning from a Gradient-based Learner [41.8663538249537]
Inverse Reinforcement Learning addresses the problem of inferring an expert's reward function from demonstrations.
In this paper, we propose a new algorithm for this setting, in which the goal is to recover the reward function being optimized by an agent.
arXiv Detail & Related papers (2020-07-15T16:41:00Z) - Reparameterized Variational Divergence Minimization for Stable Imitation [57.06909373038396]
We study the extent to which variations in the choice of probabilistic divergence may yield more performant ILO algorithms.
We contribute a re parameterization trick for adversarial imitation learning to alleviate the challenges of the promising $f$-divergence minimization framework.
Empirically, we demonstrate that our design choices allow for ILO algorithms that outperform baseline approaches and more closely match expert performance in low-dimensional continuous-control tasks.
arXiv Detail & Related papers (2020-06-18T19:04:09Z) - Optimizing for the Future in Non-Stationary MDPs [52.373873622008944]
We present a policy gradient algorithm that maximizes a forecast of future performance.
We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques.
arXiv Detail & Related papers (2020-05-17T03:41:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.