Improved Exploration in Factored Average-Reward MDPs
- URL: http://arxiv.org/abs/2009.04575v3
- Date: Thu, 11 Mar 2021 13:01:24 GMT
- Title: Improved Exploration in Factored Average-Reward MDPs
- Authors: Mohammad Sadegh Talebi, Anders Jonsson, Odalric-Ambrym Maillard
- Abstract summary: We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP)
We introduce a novel regret minimization strategy inspired by the popular UCRL2 strategy, called DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual elements of the transition function.
We show that for a generic factorization structure, DBN-UCRL achieves a regret bound, whose leading term strictly improves over existing regret bounds in terms of the dependencies on the size of $mathcal S_i$'s and the involved diameter-related terms.
- Score: 23.096751699592133
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider a regret minimization task under the average-reward criterion in
an unknown Factored Markov Decision Process (FMDP). More specifically, we
consider an FMDP where the state-action space $\mathcal X$ and the state-space
$\mathcal S$ admit the respective factored forms of $\mathcal X =
\otimes_{i=1}^n \mathcal X_i$ and $\mathcal S=\otimes_{i=1}^m \mathcal S_i$,
and the transition and reward functions are factored over $\mathcal X$ and
$\mathcal S$. Assuming known factorization structure, we introduce a novel
regret minimization strategy inspired by the popular UCRL2 strategy, called
DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual
elements of the transition function. We show that for a generic factorization
structure, DBN-UCRL achieves a regret bound, whose leading term strictly
improves over existing regret bounds in terms of the dependencies on the size
of $\mathcal S_i$'s and the involved diameter-related terms. We further show
that when the factorization structure corresponds to the Cartesian product of
some base MDPs, the regret of DBN-UCRL is upper bounded by the sum of regret of
the base MDPs. We demonstrate, through numerical experiments on standard
environments, that DBN-UCRL enjoys substantially improved regret empirically
over existing algorithms that have frequentist regret guarantees.
Related papers
- The Limits of Transfer Reinforcement Learning with Latent Low-rank Structure [9.631640936820126]
Many reinforcement learning algorithms are too costly to use in practice due to the large sizes $S, A$ of the problem's state and action space.
We consider the problem of transferring a latent low rank representation when the source and target MDPs have transition kernels.
Our algorithm learns latent representations in each source MDP and then exploits the linear structure to remove the dependence on $S, A $, or $S A$ in the target MDP regret bound.
arXiv Detail & Related papers (2024-10-28T23:12:08Z) - Provably Efficient CVaR RL in Low-rank MDPs [58.58570425202862]
We study risk-sensitive Reinforcement Learning (RL)
We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to balance interplay between exploration, exploitation, and representation learning in CVaR RL.
We prove that our algorithm achieves a sample complexity of $epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations.
arXiv Detail & Related papers (2023-11-20T17:44:40Z) - Learning Adversarial Low-rank Markov Decision Processes with Unknown
Transition and Full-information Feedback [30.23951525723659]
We study the low-rank MDPs with adversarially changed losses in the full-information feedback setting.
We propose a policy optimization-based algorithm POLO, and we prove that it attains the $widetildeO(Kfrac56Afrac12dln (1+M)/ (1-gamma)2)$ regret guarantee.
arXiv Detail & Related papers (2023-11-14T03:12:43Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Causal Markov Decision Processes: Learning Good Interventions
Efficiently [24.58691421788476]
We introduce causal Markov Decision Processes (C-MDPs), a new formalism for sequential decision making.
Many contemporary and emerging application areas such as digital healthcare and digital marketing can benefit from modeling with C-MDPs.
arXiv Detail & Related papers (2021-02-15T16:48:54Z) - Value Function Approximations via Kernel Embeddings for No-Regret
Reinforcement Learning [10.828727066443909]
We propose an online model-based RL algorithm, namely the CME-RL, that learns representations of transition distributions as embeddings in a kernel Hilbert space.
We demonstrate the efficiency of our algorithm by proving a frequentist (worst-case) regret bound that is of order $tildeObig(Hgamma_NsqrtNbig)$footnote $tildeO(cdot)$ hides only absolute constant and poly-logarithmic factors.
arXiv Detail & Related papers (2020-11-16T11:40:55Z) - On Function Approximation in Reinforcement Learning: Optimism in the
Face of Large State Spaces [208.67848059021915]
We study the exploration-exploitation tradeoff at the core of reinforcement learning.
In particular, we prove that the complexity of the function class $mathcalF$ characterizes the complexity of the function.
Our regret bounds are independent of the number of episodes.
arXiv Detail & Related papers (2020-11-09T18:32:22Z) - Efficient Reinforcement Learning in Factored MDPs with Application to
Constrained RL [25.119552984253882]
Reinforcement learning in episodic, factored Markov decision processes (FMDPs) is studied.
We propose an algorithm called FMDP-BF, which leverages the factorization structure of FMDP.
As an application, we study a new formulation of constrained RL, known as RL with knapsack constraints (RLwK)
arXiv Detail & Related papers (2020-08-31T02:20:41Z) - Towards Minimax Optimal Reinforcement Learning in Factored Markov
Decision Processes [53.72166325215299]
We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs)
First one achieves minimax optimal regret guarantees for a rich class of factored structures.
Second one enjoys better computational complexity with a slightly worse regret.
arXiv Detail & Related papers (2020-06-24T00:50:17Z) - Provably Efficient Reinforcement Learning for Discounted MDPs with
Feature Mapping [99.59319332864129]
In this paper, we study reinforcement learning for discounted Decision (MDP)
We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrtT/ (1-gamma)2)$ regret.
Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $ (1-gamma)-0.5$ factor.
arXiv Detail & Related papers (2020-06-23T17:08:54Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.