Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary
Environments
- URL: http://arxiv.org/abs/2205.13044v1
- Date: Wed, 25 May 2022 20:29:01 GMT
- Title: Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary
Environments
- Authors: Liyu Chen and Haipeng Luo
- Abstract summary: We study the study of dynamic regret for goal-oriented reinforcement learning.
The different roles of $Delta_c$ and $Delta_P$ in this lower bound inspire us to design algorithms that estimate costs and transitions separately.
- Score: 40.027926921772355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We initiate the study of dynamic regret minimization for goal-oriented
reinforcement learning modeled by a non-stationary stochastic shortest path
problem with changing cost and transition functions. We start by establishing a
lower bound $\Omega((B_{\star} SAT_{\star}(\Delta_c +
B_{\star}^2\Delta_P))^{1/3}K^{2/3})$, where $B_{\star}$ is the maximum expected
cost of the optimal policy of any episode starting from any state, $T_{\star}$
is the maximum hitting time of the optimal policy of any episode starting from
the initial state, $SA$ is the number of state-action pairs, $\Delta_c$ and
$\Delta_P$ are the amount of changes of the cost and transition functions
respectively, and $K$ is the number of episodes. The different roles of
$\Delta_c$ and $\Delta_P$ in this lower bound inspire us to design algorithms
that estimate costs and transitions separately. Specifically, assuming the
knowledge of $\Delta_c$ and $\Delta_P$, we develop a simple but sub-optimal
algorithm and another more involved minimax optimal algorithm (up to
logarithmic terms). These algorithms combine the ideas of finite-horizon
approximation [Chen et al., 2022a], special Bernstein-style bonuses of the MVP
algorithm [Zhang et al., 2020], adaptive confidence widening [Wei and Luo,
2021], as well as some new techniques such as properly penalizing long-horizon
policies. Finally, when $\Delta_c$ and $\Delta_P$ are unknown, we develop a
variant of the MASTER algorithm [Wei and Luo, 2021] and integrate the
aforementioned ideas into it to achieve $\widetilde{O}(\min\{B_{\star}
S\sqrt{ALK},
(B_{\star}^2S^2AT_{\star}(\Delta_c+B_{\star}\Delta_P))^{1/3}K^{2/3}\})$ regret,
where $L$ is the unknown number of changes of the environment.
Related papers
- Regret-Optimal Federated Transfer Learning for Kernel Regression with Applications in American Option Pricing [8.723136784230906]
We propose an optimal iterative scheme for federated transfer learning, where a central planner has access to datasets.
Our objective is to minimize the cumulative deviation of the generated parameters $thetai(t)_t=0T$ across all $T$ iterations.
By leveraging symmetries within the regret-optimal algorithm, we develop a nearly regret $_optimal that runs with $mathcalO(Np2)$ fewer elementary operations.
arXiv Detail & Related papers (2023-09-08T19:17:03Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Improved No-Regret Algorithms for Stochastic Shortest Path with Linear
MDP [31.62899359543925]
We introduce two new no-regret algorithms for the shortest path (SSP) problem with a linear MDP.
Our first algorithm is computationally efficient and achieves a regret bound $widetildeOleft(sqrtd3B_star2T_star Kright)$.
Our second algorithm is computationally inefficient but achieves the first "horizon-free" regret bound $widetildeO(d3.5B_starsqrtK)$ with no dependency on $T_star
arXiv Detail & Related papers (2021-12-18T06:47:31Z) - Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost.
We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state.
Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z) - Finding the Stochastic Shortest Path with Low Regret: The Adversarial
Cost and Unknown Transition Case [29.99619764839178]
We make significant progress toward the shortest path problem with adversarial costs and unknown transition.
Specifically, we develop algorithms that achieve $widetildeO(sqrtS3A2DT_star K)$ regret for the full-information setting.
We are also the first to consider the most challenging combination: bandit feedback with adversarial costs and unknown transition.
arXiv Detail & Related papers (2021-02-10T06:33:04Z) - Nearly Minimax Optimal Reward-free Reinforcement Learning [88.75843804630772]
We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions.
We give a new efficient algorithm, textbfStaged textbfSampling + textbfTruncated textbfPlanning (algoname), which interacts with the environment at most $Oleft( fracS2Aepsilon2textpolylogleft(fracSAHepsilon2
arXiv Detail & Related papers (2020-10-12T17:51:19Z) - Near-optimal Regret Bounds for Stochastic Shortest Path [63.029132134792555]
shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost.
We show that any learning algorithm must have at least $Omega(B_star sqrt|S| |A| K)$ regret in the worst case.
arXiv Detail & Related papers (2020-02-23T09:10:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.