Tractable Optimality in Episodic Latent MABs
- URL: http://arxiv.org/abs/2210.03528v1
- Date: Wed, 5 Oct 2022 22:53:46 GMT
- Title: Tractable Optimality in Episodic Latent MABs
- Authors: Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, Shie Mannor
- Abstract summary: We consider a multi-armed bandit problem with $M$ latent contexts, where an agent interacts with the environment for an episode of $H$ time steps.
Depending on the length of the episode, the learner may not be able to estimate accurately the latent context.
We design a procedure that provably learns a near-optimal policy with $O(textttpoly(A) + texttttpoly(M,H)min(M,H))$ interactions.
- Score: 75.17357040707347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider a multi-armed bandit problem with $M$ latent contexts, where an
agent interacts with the environment for an episode of $H$ time steps.
Depending on the length of the episode, the learner may not be able to estimate
accurately the latent context. The resulting partial observation of the
environment makes the learning task significantly more challenging. Without any
additional structural assumptions, existing techniques to tackle partially
observed settings imply the decision maker can learn a near-optimal policy with
$O(A)^H$ episodes, but do not promise more. In this work, we show that learning
with {\em polynomial} samples in $A$ is possible. We achieve this by using
techniques from experiment design. Then, through a method-of-moments approach,
we design a procedure that provably learns a near-optimal policy with
$O(\texttt{poly}(A) + \texttt{poly}(M,H)^{\min(M,H)})$ interactions. In
practice, we show that we can formulate the moment-matching via maximum
likelihood estimation. In our experiments, this significantly outperforms the
worst-case guarantees, as well as existing practical methods.
Related papers
- Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization [11.11876897168701]
We propose an algorithm that achieves a regret bound of order $tildemathcalO(mathrmpoly(H)sqrtSAT)$.
The proposed algorithm and analysis completely avoid the typical tool given by occupancy measures.
arXiv Detail & Related papers (2024-07-08T08:06:45Z) - Sample Efficient Reinforcement Learning with Partial Dynamics Knowledge [0.704590071265998]
We study the sample complexity of online Q-learning methods when some prior knowledge about the dynamics is available or can be learned efficiently.
We present an optimistic Q-learning algorithm that achieves $tildemathcalO(textPoly(H)sqrtSAT)$ regret under perfect knowledge of $f$.
arXiv Detail & Related papers (2023-12-19T19:53:58Z) - Adversarial Online Multi-Task Reinforcement Learning [12.421997449847153]
We consider the adversarial online multi-task reinforcement learning setting.
In each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models.
The learner's objective is to generalize its regret with respect to the optimal policy for each task.
arXiv Detail & Related papers (2023-01-11T02:18:26Z) - Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic
Shortest Path [106.37656068276902]
We study the sample complexity of learning an $epsilon$-optimal policy in the Shortest Path (SSP) problem.
We derive complexity bounds when the learner has access to a generative model.
We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_min$, and maximum expected cost of the optimal policy over all states $B_star$.
arXiv Detail & Related papers (2022-10-10T18:34:32Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Minimax Optimal Online Imitation Learning via Replay Estimation [47.83919594113314]
We introduce a technique of replay estimation to reduce this empirical variance.
We show that our approach achieves the optimal $widetildeO left( min(H3/2 / N, H / sqrtN$)$ dependency.
arXiv Detail & Related papers (2022-05-30T19:29:56Z) - Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations [79.66404989555566]
We consider the more realistic setting of agnostic RL with rich observation spaces and a fixed class of policies $Pi$ that may not contain any near-optimal policy.
We provide an algorithm for this setting whose error is bounded in terms of the rank $d$ of the underlying MDP.
arXiv Detail & Related papers (2021-06-22T03:20:40Z) - Online Apprenticeship Learning [58.45089581278177]
In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function.
The goal is to find a policy that matches the expert's performance on some predefined set of cost functions.
We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms.
arXiv Detail & Related papers (2021-02-13T12:57:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.