Logarithmic Switching Cost in Reinforcement Learning beyond Linear MDPs
- URL: http://arxiv.org/abs/2302.12456v1
- Date: Fri, 24 Feb 2023 05:14:27 GMT
- Title: Logarithmic Switching Cost in Reinforcement Learning beyond Linear MDPs
- Authors: Dan Qiao, Ming Yin, Yu-Xiang Wang
- Abstract summary: We propose an algorithm that achieves the near-optimal regret with a switching cost logarithmic in the number of episodes and linear in the time-horizon $H$.
We also show the doubling trick'' used in ELEANOR-LowSwitching can be further leveraged for the generalized linear function approximation.
- Score: 31.673857053336352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In many real-life reinforcement learning (RL) problems, deploying new
policies is costly. In those scenarios, algorithms must solve exploration
(which requires adaptivity) while switching the deployed policy sparsely (which
limits adaptivity). In this paper, we go beyond the existing state-of-the-art
on this problem that focused on linear Markov Decision Processes (MDPs) by
considering linear Bellman-complete MDPs with low inherent Bellman error. We
propose the ELEANOR-LowSwitching algorithm that achieves the near-optimal
regret with a switching cost logarithmic in the number of episodes and linear
in the time-horizon $H$ and feature dimension $d$. We also prove a lower bound
proportional to $dH$ among all algorithms with sublinear regret. In addition,
we show the ``doubling trick'' used in ELEANOR-LowSwitching can be further
leveraged for the generalized linear function approximation, under which we
design a sample-efficient algorithm with near-optimal switching cost.
Related papers
- Learning Infinite-Horizon Average-Reward Linear Mixture MDPs of Bounded Span [16.49229317664822]
This paper proposes a computationally tractable algorithm for learning infinite-horizon average-reward linear mixture Markov decision processes (MDPs)
Our algorithm for linear mixture MDPs achieves a nearly minimax optimal regret upper bound of $widetildemathcalO(dsqrtmathrmsp(v*)T)$ over $T$ time steps.
arXiv Detail & Related papers (2024-10-19T05:45:50Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning
with General Function Approximation [66.26739783789387]
We propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for reinforcement learning.
MQL-UCB achieves minimax optimal regret of $tildeO(dsqrtHK)$ when $K$ is sufficiently large and near-optimal policy switching cost.
Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
arXiv Detail & Related papers (2023-11-26T08:31:57Z) - Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement
Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise.
Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret.
We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv Detail & Related papers (2023-02-21T00:17:24Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z) - Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning
with Linear Function Approximation [16.871660060209674]
We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the emphreward-free exploration setting.
We propose a new algorithm that collects at most $widetildeO(fracd2H5epsilon2)$ trajectories within $H$ deployments to identify $epsilon$-optimal policy for any (possibly data-dependent) choice of reward functions.
arXiv Detail & Related papers (2022-10-03T03:48:26Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - A Provably Efficient Algorithm for Linear Markov Decision Process with
Low Switching Cost [53.968049198926444]
We present the first algorithm for linear MDP with a low switching cost.
Our algorithm achieves an $widetildeOleft(sqrtd3H4Kright)$ regret bound with a near-optimal $Oleft(d Hlog Kright)$ global switching cost.
arXiv Detail & Related papers (2021-01-02T18:41:27Z) - Nonstationary Reinforcement Learning with Linear Function Approximation [19.521419943509784]
We consider reinforcement learning in episodic Markov decision processes (MDPs) with linear function approximation under drifting environment.
We first develop an optimistic modification of least-squares value with periodic restart, and bound its dynamic regret when variation budgets are known.
We derive the first minimax dynamic regret lower bound for nonstationary linear MDPs and as a byproduct establish a minimax regret lower bound for linear MDPs unsolved by Jin et al.
arXiv Detail & Related papers (2020-10-08T20:07:44Z) - Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms and
Tighter Regret Bounds for the Non-Episodic Setting [24.90164851620799]
We study reinforcement learning in non-episodic factored Markov decision processes (FMDPs)
We propose two near-optimal and oracle-efficient algorithms for FMDPs.
Our oracle-efficient algorithms outperform previously proposed near-optimal algorithms on computer network administration simulations.
arXiv Detail & Related papers (2020-02-06T15:19:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.