Reward-Mixing MDPs with a Few Latent Contexts are Learnable
- URL: http://arxiv.org/abs/2210.02594v1
- Date: Wed, 5 Oct 2022 22:52:00 GMT
- Title: Reward-Mixing MDPs with a Few Latent Contexts are Learnable
- Authors: Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, Shie Mannor
- Abstract summary: We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
- Score: 75.17357040707347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider episodic reinforcement learning in reward-mixing Markov decision
processes (RMMDPs): at the beginning of every episode nature randomly picks a
latent reward model among $M$ candidates and an agent interacts with the MDP
throughout the episode for $H$ time steps. Our goal is to learn a near-optimal
policy that nearly maximizes the $H$ time-step cumulative rewards in such a
model. Previous work established an upper bound for RMMDPs for $M=2$. In this
work, we resolve several open questions remained for the RMMDP model. For an
arbitrary $M\ge2$, we provide a sample-efficient
algorithm--$\texttt{EM}^2$--that outputs an $\epsilon$-optimal policy using
$\tilde{O} \left(\epsilon^{-2} \cdot S^d A^d \cdot \texttt{poly}(H, Z)^d
\right)$ episodes, where $S, A$ are the number of states and actions
respectively, $H$ is the time-horizon, $Z$ is the support size of reward
distributions and $d=\min(2M-1,H)$. Our technique is a higher-order extension
of the method-of-moments based approach, nevertheless, the design and analysis
of the \algname algorithm requires several new ideas beyond existing
techniques. We also provide a lower bound of $(SA)^{\Omega(\sqrt{M})} /
\epsilon^{2}$ for a general instance of RMMDP, supporting that super-polynomial
sample complexity in $M$ is necessary.
Related papers
- Finding good policies in average-reward Markov Decision Processes without prior knowledge [19.89784209009327]
We revisit the identification of an $varepsilon$-optimal policy in average-reward Decision (MDP)
By relying on a diameter estimation procedure, we propose the first algorithm for $(varepsilon,delta)$-PAC-PAC policy identification.
arXiv Detail & Related papers (2024-05-27T12:24:14Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs [6.996002801232415]
We study the sample complexity of learning an $varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model.
For weakly communicating MDPs, we establish the complexity bound $widetildeO(SAfracHvarepsilon2 )$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space.
arXiv Detail & Related papers (2024-03-18T04:52:11Z) - Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - Reward-Free Model-Based Reinforcement Learning with Linear Function
Approximation [92.99933928528797]
We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs)
In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy.
We show that to obtain an $epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $tilde O(H4d(H + d)epsilon-2)$ episodes.
arXiv Detail & Related papers (2021-10-12T23:03:58Z) - Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample
Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy.
For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.