Related papers: A General Framework for Sequential Decision-Making under Adaptivity Constraints

A General Framework for Sequential Decision-Making under Adaptivity Constraints

URL: http://arxiv.org/abs/2306.14468v3
Date: Wed, 6 Dec 2023 08:59:24 GMT
Title: A General Framework for Sequential Decision-Making under Adaptivity Constraints
Authors: Nuoya Xiong, Zhaoran Wang, Zhuoran Yang
Abstract summary: We study general sequential decision-making under two adaptivity constraints: rare policy switch and batch learning. For the rare policy switch constraint, we provide an algorithm to achieve a $widetildemathcalO(sqrtK+K/B)$ regret with the number of batches. For the batch learning constraint, we provide an algorithm that provides a $widetildemathcalO(sqrtK+K/B)$ regret with the number of batches.
Score: 112.79808969568253
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We take the first step in studying general sequential decision-making under two adaptivity constraints: rare policy switch and batch learning. First, we provide a general class called the Eluder Condition class, which includes a wide range of reinforcement learning classes. Then, for the rare policy switch constraint, we provide a generic algorithm to achieve a $\widetilde{\mathcal{O}}(\log K) $ switching cost with a $\widetilde{\mathcal{O}}(\sqrt{K})$ regret on the EC class. For the batch learning constraint, we provide an algorithm that provides a $\widetilde{\mathcal{O}}(\sqrt{K}+K/B)$ regret with the number of batches $B.$ This paper is the first work considering rare policy switch and batch learning under general function classes, which covers nearly all the models studied in the previous works such as tabular MDP (Bai et al. 2019; Zhang et al. 2020), linear MDP (Wang et al. 2021; Gao et al. 2021), low eluder dimension MDP (Kong et al. 2021; Gao et al. 2021), generalized linear function approximation (Qiao et al. 2023), and also some new classes such as the low $D_\Delta$-type Bellman eluder dimension problem, linear mixture MDP, kernelized nonlinear regulator and undercomplete partially observed Markov decision process (POMDP).

Related papers

Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z)
Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses. We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z)
Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise. Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret. We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv Detail & Related papers (2023-02-21T00:17:24Z)
Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes. This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z)
Learning Infinite-Horizon Average-Reward Markov Decision Processes with Constraints [39.715977181666766]
We study regret for infinite-horizon average-reward Markov Decision Processes (MDPs) under cost constraints. Our algorithm ensures $widetildeO(sqrtT)$ regret and constant constraint violation for ergodic MDPs. These are the first set of provable algorithms for weakly communicating MDPs with cost constraints.
arXiv Detail & Related papers (2022-01-31T23:52:34Z)
A Model Selection Approach for Corruption Robust Reinforcement Learning [33.39130388569606]
We develop a model selection approach to tackle reinforcement learning with adversarial corruption in both transition and reward. Our algorithm achieves a regret bound of $widetildemathcalO(minfrac1Delta, sqrtT+C)$ where $T$ is the number of episodes, $C$ is the total amount of corruption, and $Delta$ is the reward gap between the best and the second-best policy.
arXiv Detail & Related papers (2021-10-07T15:59:01Z)
A Provably Efficient Algorithm for Linear Markov Decision Process with Low Switching Cost [53.968049198926444]
We present the first algorithm for linear MDP with a low switching cost. Our algorithm achieves an $widetildeOleft(sqrtd3H4Kright)$ regret bound with a near-optimal $Oleft(d Hlog Kright)$ global switching cost.
arXiv Detail & Related papers (2021-01-02T18:41:27Z)
Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation [44.374427255708135]
We develop new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation. Using the optimism principle and assuming that the MDP has a linear structure, we first propose a computationally inefficient algorithm with optimal $widetildeO(sqrtT)$ regret. Next, taking inspiration from adversarial linear bandits, we develop yet another efficient algorithm with $widetildeO(sqrtT)$ regret.
arXiv Detail & Related papers (2020-07-23T08:23:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.