Related papers: Computationally Efficient Horizon-Free Reinforcement Learning for Linear Mixture MDPs

Computationally Efficient Horizon-Free Reinforcement Learning for Linear Mixture MDPs

URL: http://arxiv.org/abs/2205.11507v1
Date: Mon, 23 May 2022 17:59:18 GMT
Title: Computationally Efficient Horizon-Free Reinforcement Learning for Linear Mixture MDPs
Authors: Dongruo Zhou and Quanquan Gu
Abstract summary: We propose the first computationally efficient horizon-free algorithm for linear mixture MDPs. Our algorithm adapts a weighted least square estimator for the unknown transitional dynamic. This also improves upon the best-known algorithms in this setting when $sigma_k2$'s are known.
Score: 111.75736569611159
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent studies have shown that episodic reinforcement learning (RL) is not more difficult than contextual bandits, even with a long planning horizon and unknown state transitions. However, these results are limited to either tabular Markov decision processes (MDPs) or computationally inefficient algorithms for linear mixture MDPs. In this paper, we propose the first computationally efficient horizon-free algorithm for linear mixture MDPs, which achieves the optimal $\tilde O(d\sqrt{K} +d^2)$ regret up to logarithmic factors. Our algorithm adapts a weighted least square estimator for the unknown transitional dynamic, where the weight is both \emph{variance-aware} and \emph{uncertainty-aware}. When applying our weighted least square estimator to heterogeneous linear bandits, we can obtain an $\tilde O(d\sqrt{\sum_{k=1}^K \sigma_k^2} +d)$ regret in the first $K$ rounds, where $d$ is the dimension of the context and $\sigma_k^2$ is the variance of the reward in the $k$-th round. This also improves upon the best-known algorithms in this setting when $\sigma_k^2$'s are known.

Related papers

Upper and Lower Bounds for Distributionally Robust Off-Dynamics Reinforcement Learning [6.236688509180343]
We study off-dynamics Reinforcement Learning (RL), where the policy training and deployment environments are different. We propose a novel algorithm We-DRIVE-U that enjoys an average suboptimality $widetildemathcalObig(d H cdot min 1/rho, H/sqrtK big)$. We also construct a novel hard instance and derive the first information-theoretic lower bound in this setting.
arXiv Detail & Related papers (2024-09-30T17:21:15Z)
Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation [1.8416014644193066]
We propose an algorithm for learning linear Markov decision processes (MDPs) and linear mixture MDPs under the Bellman optimality condition. Our algorithm for linear MDPs achieves the best-known regret upper bound of $widetildemathcalO(d3/2mathrmsp(v*)sqrtT)$ over $T$ time steps. For linear mixture MDPs, our algorithm attains a regret bound of $widetildemathcalO(dcdotmathrm
arXiv Detail & Related papers (2024-09-16T23:13:42Z)
Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses. We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z)
Best-of-Both-Worlds Algorithms for Linear Contextual Bandits [11.94312915280916]
We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits. Our algorithms deliver near-optimal regret bounds in both the adversarial and adversarial regimes.
arXiv Detail & Related papers (2023-12-24T08:27:30Z)
An Oblivious Stochastic Composite Optimization Algorithm for Eigenvalue Optimization Problems [76.2042837251496]
We introduce two oblivious mirror descent algorithms based on a complementary composite setting. Remarkably, both algorithms work without prior knowledge of the Lipschitz constant or smoothness of the objective function. We show how to extend our framework to scale and demonstrate the efficiency and robustness of our methods on large scale semidefinite programs.
arXiv Detail & Related papers (2023-06-30T08:34:29Z)
Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z)
Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise. Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret. We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv Detail & Related papers (2023-02-21T00:17:24Z)
Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP [76.94328400919836]
We show how to construct variance-aware confidence sets for linear bandits and linear mixture Decision Process (MDP) For linear bandits, we obtain an $widetildeO(mathrmpoly(d)sqrt1 + sum_i=1Ksigma_i2) regret bound, where $d is the feature dimension. For linear mixture MDP, we obtain an $widetildeO(mathrmpoly(d)sqrtK)$ regret bound, where
arXiv Detail & Related papers (2021-01-29T18:57:52Z)
Efficient Learning in Non-Stationary Linear Markov Decision Processes [17.296084954104415]
We study episodic reinforcement learning in non-stationary linear (a.k.a. low-rank) Markov Decision Processes (MDPs) We propose OPT-WLSVI an optimistic model-free algorithm based on weighted least squares value which uses exponential weights to smoothly forget data that are far in the past. We show that our algorithm, when competing against the best policy at each time, achieves a regret that is upper bounded by $widetildemathcalO(d5/4H2 Delta1/4 K3/4)$ where $d$
arXiv Detail & Related papers (2020-10-24T11:02:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.