Guarantees for Epsilon-Greedy Reinforcement Learning with Function
Approximation
- URL: http://arxiv.org/abs/2206.09421v1
- Date: Sun, 19 Jun 2022 14:44:40 GMT
- Title: Guarantees for Epsilon-Greedy Reinforcement Learning with Function
Approximation
- Authors: Christoph Dann, Yishay Mansour, Mehryar Mohri, Ayush Sekhari, Karthik
Sridharan
- Abstract summary: Myopic exploration policies such as epsilon-greedy, softmax, or Gaussian noise fail to explore efficiently in some reinforcement learning tasks.
This paper presents a theoretical analysis of such policies and provides the first regret and sample-complexity bounds for reinforcement learning with myopic exploration.
- Score: 69.1524391595912
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Myopic exploration policies such as epsilon-greedy, softmax, or Gaussian
noise fail to explore efficiently in some reinforcement learning tasks and yet,
they perform well in many others. In fact, in practice, they are often selected
as the top choices, due to their simplicity. But, for what tasks do such
policies succeed? Can we give theoretical guarantees for their favorable
performance? These crucial questions have been scarcely investigated, despite
the prominent practical importance of these policies. This paper presents a
theoretical analysis of such policies and provides the first regret and
sample-complexity bounds for reinforcement learning with myopic exploration.
Our results apply to value-function-based algorithms in episodic MDPs with
bounded Bellman Eluder dimension. We propose a new complexity measure called
myopic exploration gap, denoted by alpha, that captures a structural property
of the MDP, the exploration policy and the given value function class. We show
that the sample-complexity of myopic exploration scales quadratically with the
inverse of this quantity, 1 / alpha^2. We further demonstrate through concrete
examples that myopic exploration gap is indeed favorable in several tasks where
myopic exploration succeeds, due to the corresponding dynamics and reward
structure.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Careful at Estimation and Bold at Exploration [21.518406902400432]
Policy-based exploration is beneficial for continuous action space in deterministic policy reinforcement learning.
However, policy-based exploration has two prominent issues: aimless exploration and policy divergence.
We introduce a novel exploration strategy to mitigate these issues, separate from the policy gradient.
arXiv Detail & Related papers (2023-08-22T10:52:46Z) - On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z) - MURAL: Meta-Learning Uncertainty-Aware Rewards for Outcome-Driven
Reinforcement Learning [65.52675802289775]
We show that an uncertainty aware classifier can solve challenging reinforcement learning problems.
We propose a novel method for computing the normalized maximum likelihood (NML) distribution.
We show that the resulting algorithm has a number of intriguing connections to both count-based exploration methods and prior algorithms for learning reward functions.
arXiv Detail & Related papers (2021-07-15T08:19:57Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z) - Policy Gradient from Demonstration and Curiosity [9.69620214666782]
In this work, an integrated policy gradient algorithm was proposed to boost exploration and facilitate intrinsic reward learning.
The presented algorithm was evaluated on a range of simulated tasks with sparse extrinsic reward signals.
It was found that the agent could imitate the expert's behavior and meanwhile sustain high return.
arXiv Detail & Related papers (2020-04-22T07:57:39Z) - Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration.
We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration.
We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.