Fugu-MT 論文翻訳(概要): Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency

論文の概要: Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency

arxiv url: http://arxiv.org/abs/2302.10371v1
Date: Tue, 21 Feb 2023 00:17:24 GMT
ステータス: 翻訳完了
システム内更新日: 2023-02-22 16:54:15.580546
Title: Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency
Title（参考訳）: 線形帯域と強化学習のための変数依存回帰境界:適応性と計算効率
Authors: Heyang Zhao and Jiafan He and Dongruo Zhou and Tong Zhang and Quanquan Gu
Abstract要約: 本稿では,不整合雑音を持つ線形帯域に対する計算効率のよい最初のアルゴリズムを提案する。我々のアルゴリズムは未知のノイズの分散に適応し、$tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regretを達成する。また、強化学習において、線形混合マルコフ決定過程(MDP)に対する分散適応アルゴリズムを提案する。
参考スコア（独自算出の注目度）: 90.40062452292091
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, several studies (Zhou et al., 2021a; Zhang et al., 2021b; Kim et al., 2021; Zhou and Gu, 2022) have provided variance-dependent regret bounds for linear contextual bandits, which interpolates the regret for the worst-case regime and the deterministic reward regime. However, these algorithms are either computationally intractable or unable to handle unknown variance of the noise. In this paper, we present a novel solution to this open problem by proposing the first computationally efficient algorithm for linear bandits with heteroscedastic noise. Our algorithm is adaptive to the unknown variance of noise and achieves an $\tilde{O}(d \sqrt{\sum_{k = 1}^K \sigma_k^2} + d)$ regret, where $\sigma_k^2$ is the variance of the noise at the round $k$, $d$ is the dimension of the contexts and $K$ is the total number of rounds. Our results are based on an adaptive variance-aware confidence set enabled by a new Freedman-type concentration inequality for self-normalized martingales and a multi-layer structure to stratify the context vectors into different layers with different uniform upper bounds on the uncertainty. Furthermore, our approach can be extended to linear mixture Markov decision processes (MDPs) in reinforcement learning. We propose a variance-adaptive algorithm for linear mixture MDPs, which achieves a problem-dependent horizon-free regret bound that can gracefully reduce to a nearly constant regret for deterministic MDPs. Unlike existing nearly minimax optimal algorithms for linear mixture MDPs, our algorithm does not require explicit variance estimation of the transitional probabilities or the use of high-order moment estimators to attain horizon-free regret. We believe the techniques developed in this paper can have independent value for general online decision making problems.
Abstract（参考訳）: 近年、いくつかの研究 (Zhou et al., 2021a; Zhang et al., 2021b; Kim et al., 2021; Zhou and Gu, 2022) が、線形文脈的包帯に対する変分依存的後悔境界を提供しており、最悪のケース体制と決定論的報酬体制の後悔を補間している。しかし、これらのアルゴリズムは計算が難しいか、ノイズの未知のばらつきを処理できないかのどちらかである。本稿では,ヘテロシドスティックノイズを持つ線形バンディットに対して,最初の計算効率の高いアルゴリズムを提案することにより,この問題に対する新しい解法を提案する。このアルゴリズムは未知のノイズ分散に適応し、$\tilde{o}(d \sqrt{\sum_{k = 1}^k \sigma_k^2} + d)$ regret(ここで$\sigma_k^2$はラウンド$k$のノイズ分散、$d$はコンテキストの次元、$k$はラウンドの総数)を達成する。本研究は, 自己正規化マルチンゲールに対する新しいフリードマン型濃度不等式と, 不確実性上界が異なる異なる層にコンテキストベクトルを階層化するための多層構造によって実現される適応分散認識信頼セットに基づいている。さらに,本手法は強化学習において線形混合マルコフ決定過程(MDP)に拡張することができる。本稿では,線形混合型MDPに対する分散適応アルゴリズムを提案する。線形混合 MDP に対する既存の極小近似アルゴリズムとは異なり、我々のアルゴリズムは過渡確率の明示的な分散推定や高次モーメント推定器を用いることで、地平線無しの後悔を実現する。本論文で開発された手法は,一般的なオンライン意思決定問題に対して独立した価値を持つことができると考えている。

論文の概要: Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency

関連論文リスト