Navigating to the Best Policy in Markov Decision Processes
- URL: http://arxiv.org/abs/2106.02847v1
- Date: Sat, 5 Jun 2021 09:16:28 GMT
- Title: Navigating to the Best Policy in Markov Decision Processes
- Authors: Aymen Al Marjani, Aur\'elien Garivier, Alexandre Proutiere
- Abstract summary: We investigate the active pure exploration problem in Markov Decision Processes.
Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
- Score: 68.8204255655161
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate the classical active pure exploration problem in Markov
Decision Processes, where the agent sequentially selects actions and, from the
resulting system trajectory, aims at identifying the best policy as fast as
possible. We propose an information-theoretic lower bound on the average number
of steps required before a correct answer can be given with probability at
least $1-\delta$. This lower bound involves a non-convex optimization problem,
for which we propose a convex relaxation. We further provide an algorithm whose
sample complexity matches the relaxed lower bound up to a factor $2$. This
algorithm addresses general communicating MDPs; we propose a variant with
reduced exploration rate (and hence faster convergence) under an additional
ergodicity assumption. This work extends previous results relative to the
\emph{generative setting}~\cite{marjani2020adaptive}, where the agent could at
each step observe the random outcome of any (state, action) pair. In contrast,
we show here how to deal with the \emph{navigation constraints}. Our analysis
relies on an ergodic theorem for non-homogeneous Markov chains which we
consider of wide interest in the analysis of Markov Decision Processes.
Related papers
- Strictly Low Rank Constraint Optimization -- An Asymptotically
$\mathcal{O}(\frac{1}{t^2})$ Method [5.770309971945476]
We propose a class of non-text and non-smooth problems with textitrank regularization to promote sparsity in optimal solution.
We show that our algorithms are able to achieve a singular convergence of $Ofrac(t2)$, which is exactly same as Nesterov's optimal convergence for first-order methods on smooth convex problems.
arXiv Detail & Related papers (2023-07-04T16:55:41Z) - First Order Methods with Markovian Noise: from Acceleration to Variational Inequalities [91.46841922915418]
We present a unified approach for the theoretical analysis of first-order variation methods.
Our approach covers both non-linear gradient and strongly Monte Carlo problems.
We provide bounds that match the oracle strongly in the case of convex method optimization problems.
arXiv Detail & Related papers (2023-05-25T11:11:31Z) - Finite-Sample Analysis of Off-Policy Natural Actor-Critic Algorithm [4.932130498861987]
We provide finite-sample convergence guarantees for an off-policy variant of the natural actor-critic (NAC) algorithm based on Importance Sampling.
We show that the algorithm converges to a global optimal policy with a sample complexity of $mathcalO(epsilon-3log2(1/epsilon)$ under an appropriate choice of stepsizes.
arXiv Detail & Related papers (2021-02-18T13:22:59Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Convergence of adaptive algorithms for weakly convex constrained
optimization [59.36386973876765]
We prove the $mathcaltilde O(t-1/4)$ rate of convergence for the norm of the gradient of Moreau envelope.
Our analysis works with mini-batch size of $1$, constant first and second order moment parameters, and possibly smooth optimization domains.
arXiv Detail & Related papers (2020-06-11T17:43:19Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z) - Finite-Time Analysis of Round-Robin Kullback-Leibler Upper Confidence
Bounds for Optimal Adaptive Allocation with Multiple Plays and Markovian
Rewards [10.66048003460524]
We study an extension of the classic multi-armed bandit problem which involves multiple plays and Markovian rewards in the rested bandits setting.
In order to tackle this problem we consider an adaptive allocation rule which at each stage combines the information from the sample means of all the arms, with the Kullback-Leibler upper confidence bound of a single arm which is selected in round-robin way.
arXiv Detail & Related papers (2020-01-30T08:09:01Z) - The Simulator: Understanding Adaptive Sampling in the
Moderate-Confidence Regime [52.38455827779212]
We propose a novel technique for analyzing adaptive sampling called the em Simulator.
We prove the first instance-based lower bounds the top-k problem which incorporate the appropriate log-factors.
Our new analysis inspires a simple and near-optimal for the best-arm and top-k identification, the first em practical of its kind for the latter problem.
arXiv Detail & Related papers (2017-02-16T23:42:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.