Fugu-MT 論文翻訳(概要): Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning

論文の概要: Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning

arxiv url: http://arxiv.org/abs/2605.15692v1
Date: Fri, 15 May 2026 07:28:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-18 21:22:26.208647
Title: Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning
Title（参考訳）: 文脈的アクション・セット強化学習のためのタイタ・レグレト境界
Authors: Zijun Chen, Zihan Zhang,
Abstract要約: 固定報酬と遷移関数を用いて,エピソード依存の許容アクションセットを用いて,エピソード強化学習について検討した。エピソードごとの最適値に対して累積後悔(sum_k=1K[V*,Mk - Vk,Mk]$)で評価する。 MVPアルゴリズムが自然にこのフレームワークに拡張され、強力な理論的保証を享受していることを示す。
参考スコア（独自算出の注目度）: 17.131069269126776
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study episodic reinforcement learning with fixed reward and transition functions, but with episode-dependent admissible action sets that are observed at the start of each episode. Performance is measured by cumulative regret against the episode-wise optimal value, $\sum_{k=1}^K [V^{*,M^k} - V^{π^k,M^k}]$, where $M^k$ represents the action context in the $k$-th episode. We show that the MVP algorithm naturally extends to this framework and enjoys strong theoretical guarantees. In particular, we establish a minimax regret bound of $\widetilde{O}(\sqrt{SAH^3K\log L})$ for adversarial contexts, where $L$ denotes the number of possible contexts. This result implies a regret bound of $\widetilde{O}(\sqrt{SAH^3K})$ for stochastic contexts. We further translate the stochastic regret guarantee into a sample complexity bound of $\widetilde{O}(SAH^3/ε^2)$ for a fixed context distribution. In addition, we derive a gap-dependent regret bound of \[ \widetilde O\left( \inf_{p\in [0,1)} \left( \frac{1}{Δ_{\min}^{p}} + pKΔ_{\min}^{p} \right)\log K \cdot \mathrm{poly}(S,A,H) \right), \] where $Δ_{\min}^{p}$ is the global $p$-trimmed positive-gap floor over suboptimal $(h,s,a)$ triples. This bound can substantially improve upon the minimax rate when the relevant suboptimality gaps are large.
Abstract（参考訳）: 本研究は,各エピソードの開始時に観察される,エピソード依存の許容アクションセットを用いて,報酬と遷移関数の固定化によるエピソード強化学習について検討する。エピソードごとの最適値である$\sum_{k=1}^K[V^{*,M^k}]-V^{π^k,M^k}]$に対して累積後悔を和らげて、パフォーマンスを測定する。 MVPアルゴリズムが自然にこのフレームワークに拡張され、強力な理論的保証を享受していることを示す。特に、逆コンテキストに対して$\widetilde{O}(\sqrt{SAH^3K\log L})$のミニマックス後悔境界を確立する。この結果は、確率的文脈に対して$\widetilde{O}(\sqrt{SAH^3K})$の後悔境界を意味する。さらに、確率的後悔保証を、固定された文脈分布に対して$\widetilde{O}(SAH^3/ε^2)$のサンプル複雑性境界に変換する。さらに、[ \widetilde O\left( \inf_{p\in [0,1)} \left( \frac{1}{Δ_{\min}^{p}} + pKΔ_{\min}^{p} \right)\log K \cdot \mathrm{poly}(S,A,H) \right), \] ここで$Δ_{\min}^{p}$は、大域的な$p$(h,s,a) 上の正のギャップフロアである。この境界は、関連する準最適ギャップが大きい場合に、ミニマックス速度で大幅に改善できる。

論文の概要: Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning

関連論文リスト