Related papers: Zooming for Efficient Model-Free Reinforcement Learning in Metric Spaces

Zooming for Efficient Model-Free Reinforcement Learning in Metric Spaces

URL: http://arxiv.org/abs/2003.04069v1
Date: Mon, 9 Mar 2020 12:32:02 GMT
Title: Zooming for Efficient Model-Free Reinforcement Learning in Metric Spaces
Authors: Ahmed Touati, Adrien Ali Taiga, Marc G. Bellemare
Abstract summary: We consider episodic reinforcement learning with a continuous state-action space which is assumed to be equipped with a natural metric. We propose ZoomRL, an online algorithm that leverages ideas from continuous bandits to learn an adaptive discretization of the joint space. We show that ZoomRL achieves a worst-case regret $tildeO(Hfrac52 Kfracd+1d+2)$ where $H$ is the planning horizon, $K$ is the number of episodes and $d$ is the covering dimension of the space.
Score: 26.297887542066505
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the wealth of research into provably efficient reinforcement learning algorithms, most works focus on tabular representation and thus struggle to handle exponentially or infinitely large state-action spaces. In this paper, we consider episodic reinforcement learning with a continuous state-action space which is assumed to be equipped with a natural metric that characterizes the proximity between different states and actions. We propose ZoomRL, an online algorithm that leverages ideas from continuous bandits to learn an adaptive discretization of the joint space by zooming in more promising and frequently visited regions while carefully balancing the exploitation-exploration trade-off. We show that ZoomRL achieves a worst-case regret $\tilde{O}(H^{\frac{5}{2}} K^{\frac{d+1}{d+2}})$ where $H$ is the planning horizon, $K$ is the number of episodes and $d$ is the covering dimension of the space with respect to the metric. Moreover, our algorithm enjoys improved metric-dependent guarantees that reflect the geometry of the underlying space. Finally, we show that our algorithm is robust to small misspecification errors.

Related papers

Linear Bandits with Partially Observable Features [35.08645010112184]
We introduce a novel linear bandit problem with partially observable features, resulting in partial reward information and spurious estimates. Without proper address for latent part, regret possibly grows linearly in decision horizon $T$, as their influence on rewards are unknown. We propose a novel analysis to handle the latent features and an algorithm that achieves sublinear regret.
arXiv Detail & Related papers (2025-02-10T04:15:38Z)
Sparse Linear Bandits with Blocking Constraints [22.01704171400845]
We investigate the high-dimensional sparse linear bandits problem in a data-poor regime.<n>We show novel offline statistical guarantees of the lasso estimator for the linear model.<n>We propose a meta-algorithm based on corralling that does not need knowledge of optimal sparsity parameter $k$ at minimal cost to regret.
arXiv Detail & Related papers (2024-10-26T01:42:03Z)
Provably Adaptive Average Reward Reinforcement Learning for Metric Spaces [2.2984209387877628]
We develop an algorithm ZoRL that discretizes the state-action space adaptively and zooms into promising regions of the state-action space. ZoRL outperforms other state-of-the-art algorithms in experiments.
arXiv Detail & Related papers (2024-10-25T18:14:42Z)
Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses. We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z)
Provably Efficient CVaR RL in Low-rank MDPs [58.58570425202862]
We study risk-sensitive Reinforcement Learning (RL) We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to balance interplay between exploration, exploitation, and representation learning in CVaR RL. We prove that our algorithm achieves a sample complexity of $epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations.
arXiv Detail & Related papers (2023-11-20T17:44:40Z)
Scalable Primal-Dual Actor-Critic Method for Safe Multi-Agent RL with General Utilities [12.104551746465932]
We investigate safe multi-agent reinforcement learning, where agents seek to collectively maximize an aggregate sum of local objectives while satisfying their own safety constraints. Our algorithm converges to a first-order stationary point (FOSP) at the rate of $mathcalOleft(T-2/3right)$. In the sample-based setting, we demonstrate that, with high probability, our algorithm requires $widetildemathcalOleft(epsilon-3.5right)$ samples to achieve an $epsilon$-FOSP.
arXiv Detail & Related papers (2023-05-27T20:08:35Z)
Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations. Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z)
k-Means Maximum Entropy Exploration [55.81894038654918]
Exploration in continuous spaces with sparse rewards is an open problem in reinforcement learning. We introduce an artificial curiosity algorithm based on lower bounding an approximation to the entropy of the state visitation distribution. We show that our approach is both computationally efficient and competitive on benchmarks for exploration in high-dimensional, continuous spaces.
arXiv Detail & Related papers (2022-05-31T09:05:58Z)
Exact Reduction of Huge Action Spaces in General Reinforcement Learning [28.19950790106291]
We show how action-binarization in the non-MDP case can significantly improve Extreme State Aggregation (ESA) bounds. We provide an upper bound on the number of states of this binarized ESA that is logarithmic in the original action-space size, a double-exponential improvement.
arXiv Detail & Related papers (2020-12-18T12:45:03Z)
Value Function Approximations via Kernel Embeddings for No-Regret Reinforcement Learning [10.828727066443909]
We propose an online model-based RL algorithm, namely the CME-RL, that learns representations of transition distributions as embeddings in a kernel Hilbert space. We demonstrate the efficiency of our algorithm by proving a frequentist (worst-case) regret bound that is of order $tildeObig(Hgamma_NsqrtNbig)$footnote $tildeO(cdot)$ hides only absolute constant and poly-logarithmic factors.
arXiv Detail & Related papers (2020-11-16T11:40:55Z)
Stochastic Linear Bandits with Protected Subspace [51.43660657268171]
We study a variant of the linear bandit problem wherein we optimize a linear objective function but rewards are accrued only to an unknown subspace. In particular, at each round, the learner must choose whether to query the objective or the protected subspace alongside choosing an action. Our algorithm, derived from the OFUL principle, uses some of the queries to get an estimate of the protected space.
arXiv Detail & Related papers (2020-11-02T14:59:39Z)
Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping [99.59319332864129]
In this paper, we study reinforcement learning for discounted Decision (MDP) We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrtT/ (1-gamma)2)$ regret. Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $ (1-gamma)-0.5$ factor.
arXiv Detail & Related papers (2020-06-23T17:08:54Z)
Kernel-Based Reinforcement Learning: A Finite-Time Analysis [53.47210316424326]
We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards. We empirically validate our approach in continuous MDPs with sparse rewards.
arXiv Detail & Related papers (2020-04-12T12:23:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.