A Provably Efficient Algorithm for Linear Markov Decision Process with
Low Switching Cost
- URL: http://arxiv.org/abs/2101.00494v1
- Date: Sat, 2 Jan 2021 18:41:27 GMT
- Title: A Provably Efficient Algorithm for Linear Markov Decision Process with
Low Switching Cost
- Authors: Minbo Gao, Tianle Xie, Simon S. Du, Lin F. Yang
- Abstract summary: We present the first algorithm for linear MDP with a low switching cost.
Our algorithm achieves an $widetildeOleft(sqrtd3H4Kright)$ regret bound with a near-optimal $Oleft(d Hlog Kright)$ global switching cost.
- Score: 53.968049198926444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many real-world applications, such as those in medical domains,
recommendation systems, etc, can be formulated as large state space
reinforcement learning problems with only a small budget of the number of
policy changes, i.e., low switching cost. This paper focuses on the linear
Markov Decision Process (MDP) recently studied in [Yang et al 2019, Jin et al
2020] where the linear function approximation is used for generalization on the
large state space. We present the first algorithm for linear MDP with a low
switching cost. Our algorithm achieves an
$\widetilde{O}\left(\sqrt{d^3H^4K}\right)$ regret bound with a near-optimal
$O\left(d H\log K\right)$ global switching cost where $d$ is the feature
dimension, $H$ is the planning horizon and $K$ is the number of episodes the
agent plays. Our regret bound matches the best existing polynomial algorithm by
[Jin et al 2020] and our switching cost is exponentially smaller than theirs.
When specialized to tabular MDP, our switching cost bound improves those in
[Bai et al 2019, Zhang et al 20020]. We complement our positive result with an
$\Omega\left(dH/\log d\right)$ global switching cost lower bound for any
no-regret algorithm.
Related papers
- Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses.
We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Logarithmic Switching Cost in Reinforcement Learning beyond Linear MDPs [31.673857053336352]
We propose an algorithm that achieves the near-optimal regret with a switching cost logarithmic in the number of episodes and linear in the time-horizon $H$.
We also show the doubling trick'' used in ELEANOR-LowSwitching can be further leveraged for the generalized linear function approximation.
arXiv Detail & Related papers (2023-02-24T05:14:27Z) - Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost [31.04961854943877]
We propose a new algorithm that achieves a regret of $widetildeO(sqrtH4S2AT)$ while requiring a switching cost of $O(HSA loglog T)$.
As a byproduct, our new algorithmic techniques allow us to derive a emphreward-free exploration algorithm with an optimal switching cost of $O(HSA)$.
arXiv Detail & Related papers (2022-02-13T19:01:06Z) - Learning Infinite-Horizon Average-Reward Markov Decision Processes with
Constraints [39.715977181666766]
We study regret for infinite-horizon average-reward Markov Decision Processes (MDPs) under cost constraints.
Our algorithm ensures $widetildeO(sqrtT)$ regret and constant constraint violation for ergodic MDPs.
These are the first set of provable algorithms for weakly communicating MDPs with cost constraints.
arXiv Detail & Related papers (2022-01-31T23:52:34Z) - Improved No-Regret Algorithms for Stochastic Shortest Path with Linear
MDP [31.62899359543925]
We introduce two new no-regret algorithms for the shortest path (SSP) problem with a linear MDP.
Our first algorithm is computationally efficient and achieves a regret bound $widetildeOleft(sqrtd3B_star2T_star Kright)$.
Our second algorithm is computationally inefficient but achieves the first "horizon-free" regret bound $widetildeO(d3.5B_starsqrtK)$ with no dependency on $T_star
arXiv Detail & Related papers (2021-12-18T06:47:31Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov
Decision Processes [91.38793800392108]
We study reinforcement learning with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model.
We propose a new, computationally efficient algorithm with linear function approximation named $textUCRL-VTR+$ for the aforementioned linear mixture MDPs.
To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.
arXiv Detail & Related papers (2020-12-15T18:56:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.