Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
Feedback and Unknown Transition
- URL: http://arxiv.org/abs/2403.04568v1
- Date: Thu, 7 Mar 2024 15:03:50 GMT
- Title: Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
Feedback and Unknown Transition
- Authors: Long-Fei Li, Peng Zhao, Zhi-Hua Zhou
- Abstract summary: We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses.
We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
- Score: 71.33787410075577
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study reinforcement learning with linear function approximation, unknown
transition, and adversarial losses in the bandit feedback setting.
Specifically, we focus on linear mixture MDPs whose transition kernel is a
linear mixture model. We propose a new algorithm that attains an
$\widetilde{O}(d\sqrt{HS^3K} + \sqrt{HSAK})$ regret with high probability,
where $d$ is the dimension of feature mappings, $S$ is the size of state space,
$A$ is the size of action space, $H$ is the episode length and $K$ is the
number of episodes. Our result strictly improves the previous best-known
$\widetilde{O}(dS^2 \sqrt{K} + \sqrt{HSAK})$ result in Zhao et al. (2023a)
since $H \leq S$ holds by the layered MDP structure. Our advancements are
primarily attributed to (i) a new least square estimator for the transition
parameter that leverages the visit information of all states, as opposed to
only one state in prior work, and (ii) a new self-normalized concentration
tailored specifically to handle non-independent noises, originally proposed in
the dynamic assortment area and firstly applied in reinforcement learning to
handle correlations between different states.
Related papers
- Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs [63.47351876442425]
We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback.
We propose a novel algorithm that combines the benefits of two popular methods: occupancy-measure-based and policy-based.
Our algorithm enjoys an $widetildemathcalO(d sqrtH3 K + sqrtHK(H + barP_K$)$ dynamic regret, where $d$ is the feature dimension.
arXiv Detail & Related papers (2024-11-05T13:55:52Z) - Demystifying Linear MDPs and Novel Dynamics Aggregation Framework [8.087699764574788]
In linear MDPs, the feature dimension $d$ is lower bounded by $S/U$ in order to aptly represent transition probabilities.
We propose a novel structural aggregation framework based on dynamics, named as the "dynamics aggregation"
Our proposed algorithm exhibits statistical efficiency, achieving a regret of $ tildeO ( d_psi3/2 H3/2sqrt T)$, where $d_psi$ represents the feature dimension of aggregated subMDPs.
arXiv Detail & Related papers (2024-10-31T16:21:41Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Nearly Minimax Optimal Reinforcement Learning with Linear Function
Approximation [25.60689712525918]
We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear.
We propose a novel-efficient algorithm, LSVI-UCB$+$, which achieves an $widetildeO(HdsqrtT)$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps.
arXiv Detail & Related papers (2022-06-23T06:04:21Z) - Computationally Efficient Horizon-Free Reinforcement Learning for Linear
Mixture MDPs [111.75736569611159]
We propose the first computationally efficient horizon-free algorithm for linear mixture MDPs.
Our algorithm adapts a weighted least square estimator for the unknown transitional dynamic.
This also improves upon the best-known algorithms in this setting when $sigma_k2$'s are known.
arXiv Detail & Related papers (2022-05-23T17:59:18Z) - Variance-Aware Confidence Set: Variance-Dependent Bound for Linear
Bandits and Horizon-Free Bound for Linear Mixture MDP [76.94328400919836]
We show how to construct variance-aware confidence sets for linear bandits and linear mixture Decision Process (MDP)
For linear bandits, we obtain an $widetildeO(mathrmpoly(d)sqrt1 + sum_i=1Ksigma_i2) regret bound, where $d is the feature dimension.
For linear mixture MDP, we obtain an $widetildeO(mathrmpoly(d)sqrtK)$ regret bound, where
arXiv Detail & Related papers (2021-01-29T18:57:52Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov
Decision Processes [91.38793800392108]
We study reinforcement learning with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model.
We propose a new, computationally efficient algorithm with linear function approximation named $textUCRL-VTR+$ for the aforementioned linear mixture MDPs.
To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.
arXiv Detail & Related papers (2020-12-15T18:56:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.