Zooming for Efficient Model-Free Reinforcement Learning in Metric Spaces
- URL: http://arxiv.org/abs/2003.04069v1
- Date: Mon, 9 Mar 2020 12:32:02 GMT
- Title: Zooming for Efficient Model-Free Reinforcement Learning in Metric Spaces
- Authors: Ahmed Touati, Adrien Ali Taiga, Marc G. Bellemare
- Abstract summary: We consider episodic reinforcement learning with a continuous state-action space which is assumed to be equipped with a natural metric.
We propose ZoomRL, an online algorithm that leverages ideas from continuous bandits to learn an adaptive discretization of the joint space.
We show that ZoomRL achieves a worst-case regret $tildeO(Hfrac52 Kfracd+1d+2)$ where $H$ is the planning horizon, $K$ is the number of episodes and $d$ is the covering dimension of the space.
- Score: 26.297887542066505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the wealth of research into provably efficient reinforcement learning
algorithms, most works focus on tabular representation and thus struggle to
handle exponentially or infinitely large state-action spaces. In this paper, we
consider episodic reinforcement learning with a continuous state-action space
which is assumed to be equipped with a natural metric that characterizes the
proximity between different states and actions. We propose ZoomRL, an online
algorithm that leverages ideas from continuous bandits to learn an adaptive
discretization of the joint space by zooming in more promising and frequently
visited regions while carefully balancing the exploitation-exploration
trade-off. We show that ZoomRL achieves a worst-case regret
$\tilde{O}(H^{\frac{5}{2}} K^{\frac{d+1}{d+2}})$ where $H$ is the planning
horizon, $K$ is the number of episodes and $d$ is the covering dimension of the
space with respect to the metric. Moreover, our algorithm enjoys improved
metric-dependent guarantees that reflect the geometry of the underlying space.
Finally, we show that our algorithm is robust to small misspecification errors.
Related papers
- Provably Adaptive Average Reward Reinforcement Learning for Metric Spaces [2.2984209387877628]
We develop an algorithm ZoRL that discretizes the state-action space adaptively and zooms into promising regions of the state-action space.
ZoRL outperforms other state-of-the-art algorithms in experiments.
arXiv Detail & Related papers (2024-10-25T18:14:42Z) - Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses.
We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z) - Scalable Primal-Dual Actor-Critic Method for Safe Multi-Agent RL with
General Utilities [12.104551746465932]
We investigate safe multi-agent reinforcement learning, where agents seek to collectively maximize an aggregate sum of local objectives while satisfying their own safety constraints.
Our algorithm converges to a first-order stationary point (FOSP) at the rate of $mathcalOleft(T-2/3right)$.
In the sample-based setting, we demonstrate that, with high probability, our algorithm requires $widetildemathcalOleft(epsilon-3.5right)$ samples to achieve an $epsilon$-FOSP.
arXiv Detail & Related papers (2023-05-27T20:08:35Z) - Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations.
Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z) - k-Means Maximum Entropy Exploration [55.81894038654918]
Exploration in continuous spaces with sparse rewards is an open problem in reinforcement learning.
We introduce an artificial curiosity algorithm based on lower bounding an approximation to the entropy of the state visitation distribution.
We show that our approach is both computationally efficient and competitive on benchmarks for exploration in high-dimensional, continuous spaces.
arXiv Detail & Related papers (2022-05-31T09:05:58Z) - Exact Reduction of Huge Action Spaces in General Reinforcement Learning [28.19950790106291]
We show how action-binarization in the non-MDP case can significantly improve Extreme State Aggregation (ESA) bounds.
We provide an upper bound on the number of states of this binarized ESA that is logarithmic in the original action-space size, a double-exponential improvement.
arXiv Detail & Related papers (2020-12-18T12:45:03Z) - Value Function Approximations via Kernel Embeddings for No-Regret
Reinforcement Learning [10.828727066443909]
We propose an online model-based RL algorithm, namely the CME-RL, that learns representations of transition distributions as embeddings in a kernel Hilbert space.
We demonstrate the efficiency of our algorithm by proving a frequentist (worst-case) regret bound that is of order $tildeObig(Hgamma_NsqrtNbig)$footnote $tildeO(cdot)$ hides only absolute constant and poly-logarithmic factors.
arXiv Detail & Related papers (2020-11-16T11:40:55Z) - Stochastic Linear Bandits with Protected Subspace [51.43660657268171]
We study a variant of the linear bandit problem wherein we optimize a linear objective function but rewards are accrued only to an unknown subspace.
In particular, at each round, the learner must choose whether to query the objective or the protected subspace alongside choosing an action.
Our algorithm, derived from the OFUL principle, uses some of the queries to get an estimate of the protected space.
arXiv Detail & Related papers (2020-11-02T14:59:39Z) - Provably Efficient Reinforcement Learning for Discounted MDPs with
Feature Mapping [99.59319332864129]
In this paper, we study reinforcement learning for discounted Decision (MDP)
We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrtT/ (1-gamma)2)$ regret.
Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $ (1-gamma)-0.5$ factor.
arXiv Detail & Related papers (2020-06-23T17:08:54Z) - Kernel-Based Reinforcement Learning: A Finite-Time Analysis [53.47210316424326]
We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards.
We empirically validate our approach in continuous MDPs with sparse rewards.
arXiv Detail & Related papers (2020-04-12T12:23:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.