Reinforcement Learning in Reward-Mixing MDPs
- URL: http://arxiv.org/abs/2110.03743v1
- Date: Thu, 7 Oct 2021 18:55:49 GMT
- Title: Reinforcement Learning in Reward-Mixing MDPs
- Authors: Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, Shie Mannor
- Abstract summary: episodic reinforcement learning in a reward-mixing Markov decision process (MDP)
cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
epsilon$-optimal policy after exploring $tildeO(poly(H,epsilon-1) cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
- Score: 74.41782017817808
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning a near optimal policy in a partially observable system remains an
elusive challenge in contemporary reinforcement learning. In this work, we
consider episodic reinforcement learning in a reward-mixing Markov decision
process (MDP). There, a reward function is drawn from one of multiple possible
reward models at the beginning of every episode, but the identity of the chosen
reward model is not revealed to the agent. Hence, the latent state space, for
which the dynamics are Markovian, is not given to the agent. We study the
problem of learning a near optimal policy for two reward-mixing MDPs. Unlike
existing approaches that rely on strong assumptions on the dynamics, we make no
assumptions and study the problem in full generality. Indeed, with no further
assumptions, even for two switching reward-models, the problem requires several
new ideas beyond existing algorithmic and analysis techniques for efficient
exploration. We provide the first polynomial-time algorithm that finds an
$\epsilon$-optimal policy after exploring $\tilde{O}(poly(H,\epsilon^{-1})
\cdot S^2 A^2)$ episodes, where $H$ is time-horizon and $S, A$ are the number
of states and actions respectively. This is the first efficient algorithm that
does not require any assumptions in partially observed environments where the
observation space is smaller than the latent state space.
Related papers
- Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs [63.47351876442425]
We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback.
We propose a novel algorithm that combines the benefits of two popular methods: occupancy-measure-based and policy-based.
Our algorithm enjoys an $widetildemathcalO(d sqrtH3 K + sqrtHK(H + barP_K$)$ dynamic regret, where $d$ is the feature dimension.
arXiv Detail & Related papers (2024-11-05T13:55:52Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Stochastic Principal-Agent Problems: Efficient Computation and Learning [25.637633553882985]
A principal and an agent interact in a environment, each privy to observations about the state not available to the other.
The model encompasses as special cases extensive-form games (EFGs) and approaches games of Markov decision processes (POMDPs)
We show an efficient algorithm for an episodic reinforcement learning setting where transition probabilities are unknown.
arXiv Detail & Related papers (2023-06-06T16:20:44Z) - Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning [17.239062061431646]
This paper studies reward-agnostic exploration in reinforcement learning (RL)
Consider a finite-horizon inhomogeneous decision process with $S$ states, $A$ actions, and a horizon length $H$.
Our algorithm is able to yield $varepsilon$ accuracy for arbitrarily many reward functions.
arXiv Detail & Related papers (2023-04-14T17:46:49Z) - Improved Sample Complexity for Reward-free Reinforcement Learning under
Low-rank MDPs [43.53286390357673]
This paper focuses on reward-free reinforcement learning under low-rank MDP models.
We first provide the first known sample complexity lower bound for any algorithm under low-rank MDPs.
We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an $epsilon$-optimal policy and achieve an $epsilon$-accurate system identification.
arXiv Detail & Related papers (2023-03-20T04:39:39Z) - Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs)
At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward.
We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z) - Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration.
We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration.
We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.