Towards Tractable Optimism in Model-Based Reinforcement Learning
- URL: http://arxiv.org/abs/2006.11911v2
- Date: Fri, 3 Dec 2021 21:16:42 GMT
- Title: Towards Tractable Optimism in Model-Based Reinforcement Learning
- Authors: Aldo Pacchiano and Philip J. Ball and Jack Parker-Holder and Krzysztof
Choromanski and Stephen Roberts
- Abstract summary: To be successful, an optimistic RL algorithm must over-estimate the true value function (optimism) but not by so much that it is inaccurate (estimation error)
We re-interpret these scalable optimistic model-based algorithms as solving a tractable noise augmented MDP.
We show that if this error is reduced, optimistic model-based RL algorithms can match state-of-the-art performance in continuous control problems.
- Score: 37.51073590932658
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The principle of optimism in the face of uncertainty is prevalent throughout
sequential decision making problems such as multi-armed bandits and
reinforcement learning (RL). To be successful, an optimistic RL algorithm must
over-estimate the true value function (optimism) but not by so much that it is
inaccurate (estimation error). In the tabular setting, many state-of-the-art
methods produce the required optimism through approaches which are intractable
when scaling to deep RL. We re-interpret these scalable optimistic model-based
algorithms as solving a tractable noise augmented MDP. This formulation
achieves a competitive regret bound: $\tilde{\mathcal{O}}(
|\mathcal{S}|H\sqrt{|\mathcal{A}| T } )$ when augmenting using Gaussian noise,
where $T$ is the total number of environment steps. We also explore how this
trade-off changes in the deep RL setting, where we show empirically that
estimation error is significantly more troublesome. However, we also show that
if this error is reduced, optimistic model-based RL algorithms can match
state-of-the-art performance in continuous control problems.
Related papers
- Accelerated zero-order SGD under high-order smoothness and overparameterized regime [79.85163929026146]
We present a novel gradient-free algorithm to solve convex optimization problems.
Such problems are encountered in medicine, physics, and machine learning.
We provide convergence guarantees for the proposed algorithm under both types of noise.
arXiv Detail & Related papers (2024-11-21T10:26:17Z) - Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL)
We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo.
Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z) - Model-Based Reinforcement Learning with Multinomial Logistic Function Approximation [10.159501412046508]
We study model-based reinforcement learning (RL) for episodic Markov decision processes (MDP)
We establish a provably efficient RL algorithm for the MDP whose state transition is given by a multinomial logistic model.
To the best of our knowledge, this is the first model-based RL algorithm with multinomial logistic function approximation with provable guarantees.
arXiv Detail & Related papers (2022-12-27T16:25:09Z) - Delayed Geometric Discounts: An Alternative Criterion for Reinforcement
Learning [1.52292571922932]
reinforcement learning (RL) proposes a theoretical background to learn optimal behaviors.
In practice, RL algorithms rely on geometric discounts to evaluate this optimality.
In this paper, we tackle these issues by generalizing the discounted problem formulation with a family of delayed objective functions.
arXiv Detail & Related papers (2022-09-26T07:49:38Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - A Generalised Inverse Reinforcement Learning Framework [24.316047317028147]
inverse Reinforcement Learning (IRL) is to estimate the unknown cost function of some MDP base on observed trajectories.
We introduce an alternative training loss that puts more weights on future states which yields a reformulation of the (maximum entropy) IRL problem.
The algorithms we devised exhibit enhanced performances (and similar tractability) than off-the-shelf ones in multiple OpenAI gym environments.
arXiv Detail & Related papers (2021-05-25T10:30:45Z) - Online Model Selection for Reinforcement Learning with Function
Approximation [50.008542459050155]
We present a meta-algorithm that adapts to the optimal complexity with $tildeO(L5/6 T2/3)$ regret.
We also show that the meta-algorithm automatically admits significantly improved instance-dependent regret bounds.
arXiv Detail & Related papers (2020-11-19T10:00:54Z) - Efficient Model-Based Reinforcement Learning through Optimistic Policy
Search and Planning [93.1435980666675]
We show how optimistic exploration can be easily combined with state-of-the-art reinforcement learning algorithms.
Our experiments demonstrate that optimistic exploration significantly speeds-up learning when there are penalties on actions.
arXiv Detail & Related papers (2020-06-15T18:37:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.