Related papers: Towards Tractable Optimism in Model-Based Reinforcement Learning

Towards Tractable Optimism in Model-Based Reinforcement Learning

URL: http://arxiv.org/abs/2006.11911v2
Date: Fri, 3 Dec 2021 21:16:42 GMT
Title: Towards Tractable Optimism in Model-Based Reinforcement Learning
Authors: Aldo Pacchiano and Philip J. Ball and Jack Parker-Holder and Krzysztof Choromanski and Stephen Roberts
Abstract summary: To be successful, an optimistic RL algorithm must over-estimate the true value function (optimism) but not by so much that it is inaccurate (estimation error) We re-interpret these scalable optimistic model-based algorithms as solving a tractable noise augmented MDP. We show that if this error is reduced, optimistic model-based RL algorithms can match state-of-the-art performance in continuous control problems.
Score: 37.51073590932658
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The principle of optimism in the face of uncertainty is prevalent throughout sequential decision making problems such as multi-armed bandits and reinforcement learning (RL). To be successful, an optimistic RL algorithm must over-estimate the true value function (optimism) but not by so much that it is inaccurate (estimation error). In the tabular setting, many state-of-the-art methods produce the required optimism through approaches which are intractable when scaling to deep RL. We re-interpret these scalable optimistic model-based algorithms as solving a tractable noise augmented MDP. This formulation achieves a competitive regret bound: $\tilde{\mathcal{O}}( |\mathcal{S}|H\sqrt{|\mathcal{A}| T } )$ when augmenting using Gaussian noise, where $T$ is the total number of environment steps. We also explore how this trade-off changes in the deep RL setting, where we show empirically that estimation error is significantly more troublesome. However, we also show that if this error is reduced, optimistic model-based RL algorithms can match state-of-the-art performance in continuous control problems.

Related papers

Accelerated zero-order SGD under high-order smoothness and overparameterized regime [79.85163929026146]
We present a novel gradient-free algorithm to solve convex optimization problems. Such problems are encountered in medicine, physics, and machine learning. We provide convergence guarantees for the proposed algorithm under both types of noise.
arXiv Detail & Related papers (2024-11-21T10:26:17Z)
Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL) We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z)
Model-Based Reinforcement Learning with Multinomial Logistic Function Approximation [10.159501412046508]
We study model-based reinforcement learning (RL) for episodic Markov decision processes (MDP) We establish a provably efficient RL algorithm for the MDP whose state transition is given by a multinomial logistic model. To the best of our knowledge, this is the first model-based RL algorithm with multinomial logistic function approximation with provable guarantees.
arXiv Detail & Related papers (2022-12-27T16:25:09Z)
Delayed Geometric Discounts: An Alternative Criterion for Reinforcement Learning [1.52292571922932]
reinforcement learning (RL) proposes a theoretical background to learn optimal behaviors. In practice, RL algorithms rely on geometric discounts to evaluate this optimality. In this paper, we tackle these issues by generalizing the discounted problem formulation with a family of delayed objective functions.
arXiv Detail & Related papers (2022-09-26T07:49:38Z)
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences. Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z)
Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning. The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z)
Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations [79.66404989555566]
We consider the more realistic setting of agnostic RL with rich observation spaces and a fixed class of policies $Pi$ that may not contain any near-optimal policy. We provide an algorithm for this setting whose error is bounded in terms of the rank $d$ of the underlying MDP.
arXiv Detail & Related papers (2021-06-22T03:20:40Z)
A Generalised Inverse Reinforcement Learning Framework [24.316047317028147]
inverse Reinforcement Learning (IRL) is to estimate the unknown cost function of some MDP base on observed trajectories. We introduce an alternative training loss that puts more weights on future states which yields a reformulation of the (maximum entropy) IRL problem. The algorithms we devised exhibit enhanced performances (and similar tractability) than off-the-shelf ones in multiple OpenAI gym environments.
arXiv Detail & Related papers (2021-05-25T10:30:45Z)
Online Model Selection for Reinforcement Learning with Function Approximation [50.008542459050155]
We present a meta-algorithm that adapts to the optimal complexity with $tildeO(L5/6 T2/3)$ regret. We also show that the meta-algorithm automatically admits significantly improved instance-dependent regret bounds.
arXiv Detail & Related papers (2020-11-19T10:00:54Z)
Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning [93.1435980666675]
We show how optimistic exploration can be easily combined with state-of-the-art reinforcement learning algorithms. Our experiments demonstrate that optimistic exploration significantly speeds-up learning when there are penalties on actions.
arXiv Detail & Related papers (2020-06-15T18:37:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.