Optimism in Face of a Context: Regret Guarantees for Stochastic
Contextual MDP
- URL: http://arxiv.org/abs/2207.11126v1
- Date: Fri, 22 Jul 2022 15:00:15 GMT
- Title: Optimism in Face of a Context: Regret Guarantees for Stochastic
Contextual MDP
- Authors: Orin Levy and Yishay Mansour
- Abstract summary: We present regret algorithms for contextual MDPs under minimum reachability assumption.
Our approach is the first optimistic approach applied to contextual MDPs with general function approximation.
- Score: 46.86114958340962
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present regret minimization algorithms for stochastic contextual MDPs
under minimum reachability assumption, using an access to an offline least
square regression oracle. We analyze three different settings: where the
dynamics is known, where the dynamics is unknown but independent of the context
and the most challenging setting where the dynamics is unknown and
context-dependent. For the latter, our algorithm obtains $
\tilde{O}\left(
\max\{H,{1}/{p_{min}}\}H|S|^{3/2}\sqrt{|A|T\log(\max\{|\mathcal{F}|,|\mathcal{P}|\}/\delta)}
\right)$ regret bound, with probability $1-\delta$, where $\mathcal{P}$ and
$\mathcal{F}$ are finite and realizable function classes used to approximate
the dynamics and rewards respectively, $p_{min}$ is the minimum reachability
parameter, $S$ is the set of states, $A$ the set of actions, $H$ the horizon,
and $T$ the number of episodes. To our knowledge, our approach is the first
optimistic approach applied to contextual MDPs with general function
approximation (i.e., without additional knowledge regarding the function class,
such as it being linear and etc.). In addition, we present a lower bound of
$\Omega(\sqrt{T H |S| |A| \ln(|\mathcal{F}|/|S|)/\ln(|A|)})$, on the expected
regret which holds even in the case of known dynamics.
Related papers
- Online Newton Method for Bandit Convex Optimisation [28.66596225688161]
We introduce a computationally efficient algorithm for zeroth-order bandit convex optimisation.
We prove that in the adversarial setting its regret is at most $d3.5 sqrtn mathrmpolylog(n, d)$ with high probability where $d$ is the time horizon.
In the setting the bound improves to $M d2 sqrtn mathrmpolylog(n, d)$ where $M in [d-1/2, d-1 / 4]$ is
arXiv Detail & Related papers (2024-06-10T17:44:11Z) - Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Efficient Rate Optimal Regret for Adversarial Contextual MDPs Using
Online Function Approximation [47.18926328995424]
We present the OMG-CMDP! algorithm for regret minimization in adversarial Contextual MDPs.
Our algorithm is efficient (assuming efficient online regression oracles) and simple and robust to approximation errors.
arXiv Detail & Related papers (2023-03-02T18:27:00Z) - Eluder-based Regret for Stochastic Contextual MDPs [43.19667415823089]
We present the E-UC$3$RL algorithm for regret minimization in Contextual Markov Decision Processes (CMDPs)
Our algorithm is efficient (assuming efficient offline regression oracles) and enjoys a regret guarantee of $ widetildeO(H3 sqrtT |S| |A|d_mathrmE(mathcalP)$.
arXiv Detail & Related papers (2022-11-27T20:38:47Z) - On Submodular Contextual Bandits [92.45432756301231]
We consider the problem of contextual bandits where actions are subsets of a ground set and mean rewards are modeled by an unknown monotone submodular function.
We show that our algorithm efficiently randomizes around local optima of estimated functions according to the Inverse Gap Weighting strategy.
arXiv Detail & Related papers (2021-12-03T21:42:33Z) - Provably Breaking the Quadratic Error Compounding Barrier in Imitation
Learning, Optimally [58.463668865380946]
We study the statistical limits of Imitation Learning in episodic Markov Decision Processes (MDPs) with a state space $mathcalS$.
We establish an upper bound $O(|mathcalS|H3/2/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al ( 2020)
We show the minimax suboptimality grows as $Omega( H3/2/N)$ when $mathcalS|geq 3$ while the unknown-transition setting suffers from a larger sharp rate
arXiv Detail & Related papers (2021-02-25T15:50:19Z) - Toward the Fundamental Limits of Imitation Learning [29.87139380803829]
Imitation learning (IL) aims to mimic the behavior of an expert policy in a sequential decision-making problem given only demonstrations.
We first consider the setting where the learner is provided a dataset of $N$ expert trajectories ahead of time, and cannot interact with the MDP.
We show that the policy which mimics the expert whenever possible is in expectation $lesssim frac|mathcalS| H2 log (N)N$ suboptimal compared to the value of the expert, even when the expert follows an arbitrary policy.
arXiv Detail & Related papers (2020-09-13T12:45:52Z) - $Q$-learning with Logarithmic Regret [60.24952657636464]
We prove that an optimistic $Q$-learning enjoys a $mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Delta_min$ is the minimum sub-optimality gap.
arXiv Detail & Related papers (2020-06-16T13:01:33Z) - Taking a hint: How to leverage loss predictors in contextual bandits? [63.546913998407405]
We study learning in contextual bandits with the help of loss predictors.
We show that the optimal regret is $mathcalO(minsqrtT, sqrtmathcalETfrac13)$ when $mathcalE$ is known.
arXiv Detail & Related papers (2020-03-04T07:36:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.