Online Sparse Reinforcement Learning
- URL: http://arxiv.org/abs/2011.04018v4
- Date: Wed, 10 Feb 2021 15:55:09 GMT
- Title: Online Sparse Reinforcement Learning
- Authors: Botao Hao, Tor Lattimore, Csaba Szepesv\'ari, Mengdi Wang
- Abstract summary: We investigate the hardness of online reinforcement learning in fixed horizon, sparse linear decision process (MDP)
We show that linear regret is generally unavoidable in this case, even if there exists a policy that collects well-conditioned data.
This shows that in the large-action setting, the difficulty of learning can be attributed to the difficulty of finding a good exploratory policy.
- Score: 60.44832065993122
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate the hardness of online reinforcement learning in fixed
horizon, sparse linear Markov decision process (MDP), with a special focus on
the high-dimensional regime where the ambient dimension is larger than the
number of episodes. Our contribution is two-fold. First, we provide a lower
bound showing that linear regret is generally unavoidable in this case, even if
there exists a policy that collects well-conditioned data. The lower bound
construction uses an MDP with a fixed number of states while the number of
actions scales with the ambient dimension. Note that when the horizon is fixed
to one, the case of linear stochastic bandits, the linear regret can be
avoided. Second, we show that if the learner has oracle access to a policy that
collects well-conditioned data then a variant of Lasso fitted Q-iteration
enjoys a nearly dimension-free regret of $\tilde{O}( s^{2/3} N^{2/3})$ where
$N$ is the number of episodes and $s$ is the sparsity level. This shows that in
the large-action setting, the difficulty of learning can be attributed to the
difficulty of finding a good exploratory policy.
Related papers
- Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs [63.47351876442425]
We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback.
We propose a novel algorithm that combines the benefits of two popular methods: occupancy-measure-based and policy-based.
Our algorithm enjoys an $widetildemathcalO(d sqrtH3 K + sqrtHK(H + barP_K$)$ dynamic regret, where $d$ is the feature dimension.
arXiv Detail & Related papers (2024-11-05T13:55:52Z) - First-Order Regret in Reinforcement Learning with Linear Function
Approximation: A Robust Estimation Approach [57.570201404222935]
We show that it is possible to obtain regret scaling as $mathcalO(sqrtV_1star K)$ in reinforcement learning with large state spaces.
We demonstrate that existing techniques based on at least squares estimation are insufficient to obtain this result.
arXiv Detail & Related papers (2021-12-07T00:29:57Z) - Improved Regret Analysis for Variance-Adaptive Linear Bandits and
Horizon-Free Linear Mixture MDPs [12.450760567361531]
In online learning problems, exploiting low variance plays an important role in obtaining tight performance guarantees.
We present novel analyses that improve their regret bounds significantly.
Our analysis critically relies on a novel elliptical potential count' lemma.
arXiv Detail & Related papers (2021-11-05T06:47:27Z) - High-Dimensional Sparse Linear Bandits [67.9378546011416]
We derive a novel $Omega(n2/3)$ dimension-free minimax regret lower bound for sparse linear bandits in the data-poor regime.
We also prove a dimension-free $O(sqrtn)$ regret upper bound under an additional assumption on the magnitude of the signal for relevant features.
arXiv Detail & Related papers (2020-11-08T16:48:11Z) - Efficient Learning in Non-Stationary Linear Markov Decision Processes [17.296084954104415]
We study episodic reinforcement learning in non-stationary linear (a.k.a. low-rank) Markov Decision Processes (MDPs)
We propose OPT-WLSVI an optimistic model-free algorithm based on weighted least squares value which uses exponential weights to smoothly forget data that are far in the past.
We show that our algorithm, when competing against the best policy at each time, achieves a regret that is upper bounded by $widetildemathcalO(d5/4H2 Delta1/4 K3/4)$ where $d$
arXiv Detail & Related papers (2020-10-24T11:02:45Z) - Nearly Dimension-Independent Sparse Linear Bandit over Small Action
Spaces via Best Subset Selection [71.9765117768556]
We consider the contextual bandit problem under the high dimensional linear model.
This setting finds essential applications such as personalized recommendation, online advertisement, and personalized medicine.
We propose doubly growing epochs and estimating the parameter using the best subset selection method.
arXiv Detail & Related papers (2020-09-04T04:10:39Z) - Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon
Reinforcement Learning? [108.94173231481355]
Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems.
We show that long horizon RL is no more difficult than short horizon RL, at least in a minimax sense.
arXiv Detail & Related papers (2020-05-01T17:56:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.