Learning Adversarial Low-rank Markov Decision Processes with Unknown
Transition and Full-information Feedback
- URL: http://arxiv.org/abs/2311.07876v1
- Date: Tue, 14 Nov 2023 03:12:43 GMT
- Title: Learning Adversarial Low-rank Markov Decision Processes with Unknown
Transition and Full-information Feedback
- Authors: Canzhe Zhao, Ruofeng Yang, Baoxiang Wang, Xuezhou Zhang, Shuai Li
- Abstract summary: We study the low-rank MDPs with adversarially changed losses in the full-information feedback setting.
We propose a policy optimization-based algorithm POLO, and we prove that it attains the $widetildeO(Kfrac56Afrac12dln (1+M)/ (1-gamma)2)$ regret guarantee.
- Score: 30.23951525723659
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we study the low-rank MDPs with adversarially changed losses in
the full-information feedback setting. In particular, the unknown transition
probability kernel admits a low-rank matrix decomposition \citep{REPUCB22}, and
the loss functions may change adversarially but are revealed to the learner at
the end of each episode. We propose a policy optimization-based algorithm POLO,
and we prove that it attains the
$\widetilde{O}(K^{\frac{5}{6}}A^{\frac{1}{2}}d\ln(1+M)/(1-\gamma)^2)$ regret
guarantee, where $d$ is rank of the transition kernel (and hence the dimension
of the unknown representations), $A$ is the cardinality of the action space,
$M$ is the cardinality of the model class, and $\gamma$ is the discounted
factor. Notably, our algorithm is oracle-efficient and has a regret guarantee
with no dependence on the size of potentially arbitrarily large state space.
Furthermore, we also prove an $\Omega(\frac{\gamma^2}{1-\gamma} \sqrt{d A K})$
regret lower bound for this problem, showing that low-rank MDPs are
statistically more difficult to learn than linear MDPs in the regret
minimization setting. To the best of our knowledge, we present the first
algorithm that interleaves representation learning, exploration, and
exploitation to achieve the sublinear regret guarantee for RL with nonlinear
function approximation and adversarial losses.
Related papers
- Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs [63.47351876442425]
We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback.
We propose a novel algorithm that combines the benefits of two popular methods: occupancy-measure-based and policy-based.
Our algorithm enjoys an $widetildemathcalO(d sqrtH3 K + sqrtHK(H + barP_K$)$ dynamic regret, where $d$ is the feature dimension.
arXiv Detail & Related papers (2024-11-05T13:55:52Z) - The Limits of Transfer Reinforcement Learning with Latent Low-rank Structure [9.631640936820126]
Many reinforcement learning algorithms are too costly to use in practice due to the large sizes $S, A$ of the problem's state and action space.
We consider the problem of transferring a latent low rank representation when the source and target MDPs have transition kernels.
Our algorithm learns latent representations in each source MDP and then exploits the linear structure to remove the dependence on $S, A $, or $S A$ in the target MDP regret bound.
arXiv Detail & Related papers (2024-10-28T23:12:08Z) - A Doubly Robust Approach to Sparse Reinforcement Learning [19.68978899041642]
We propose a new regret algorithm for episodic sparse linear Markov decision process (SMDP)
The proposed algorithm is $tildeO(sigma-1_min s_star H sqrtN)$, where $sigma_min$ denotes the restrictive minimum eigenvalue of the average Gram matrix of feature vectors.
arXiv Detail & Related papers (2023-10-23T18:52:17Z) - Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function
Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback.
We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode.
We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z) - Provably Efficient Reinforcement Learning for Discounted MDPs with
Feature Mapping [99.59319332864129]
In this paper, we study reinforcement learning for discounted Decision (MDP)
We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrtT/ (1-gamma)2)$ regret.
Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $ (1-gamma)-0.5$ factor.
arXiv Detail & Related papers (2020-06-23T17:08:54Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.