Model-Free Non-Stationary RL: Near-Optimal Regret and Applications in
Multi-Agent RL and Inventory Control
- URL: http://arxiv.org/abs/2010.03161v4
- Date: Sat, 20 Aug 2022 02:54:31 GMT
- Title: Model-Free Non-Stationary RL: Near-Optimal Regret and Applications in
Multi-Agent RL and Inventory Control
- Authors: Weichao Mao, Kaiqing Zhang, Ruihao Zhu, David Simchi-Levi, Tamer
Ba\c{s}ar
- Abstract summary: We propose Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), the first model-free algorithm for non-stationary RL.
We show that our algorithms are emphnearly optimal by establishing an information-theoretical lower bound of $Omega(Sfrac13 Afrac13 Deltafrac13 Hfrac23 Tfrac23)$, the first lower bound in non-stationary RL.
- Score: 28.80743320843154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider model-free reinforcement learning (RL) in non-stationary Markov
decision processes. Both the reward functions and the state transition
functions are allowed to vary arbitrarily over time as long as their cumulative
variations do not exceed certain variation budgets. We propose Restarted
Q-Learning with Upper Confidence Bounds (RestartQ-UCB), the first model-free
algorithm for non-stationary RL, and show that it outperforms existing
solutions in terms of dynamic regret. Specifically, RestartQ-UCB with
Freedman-type bonus terms achieves a dynamic regret bound of
$\widetilde{O}(S^{\frac{1}{3}} A^{\frac{1}{3}} \Delta^{\frac{1}{3}} H
T^{\frac{2}{3}})$, where $S$ and $A$ are the numbers of states and actions,
respectively, $\Delta>0$ is the variation budget, $H$ is the number of time
steps per episode, and $T$ is the total number of time steps. We further
present a parameter-free algorithm named Double-Restart Q-UCB that does not
require prior knowledge of the variation budget. We show that our algorithms
are \emph{nearly optimal} by establishing an information-theoretical lower
bound of $\Omega(S^{\frac{1}{3}} A^{\frac{1}{3}} \Delta^{\frac{1}{3}}
H^{\frac{2}{3}} T^{\frac{2}{3}})$, the first lower bound in non-stationary RL.
Numerical experiments validate the advantages of RestartQ-UCB in terms of both
cumulative rewards and computational efficiency. We demonstrate the power of
our results in examples of multi-agent RL and inventory control across related
products.
Related papers
- Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost.
We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z) - Near-Optimal Adversarial Reinforcement Learning with Switching Costs [43.895798638743784]
We show how to develop a provably efficient algorithm for adversarial RL with switching costs.
Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret is no longer achievable.
We propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known.
arXiv Detail & Related papers (2023-02-08T23:41:29Z) - Bridging Distributional and Risk-sensitive Reinforcement Learning with
Provable Regret Bounds [24.571530193140916]
We consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return.
We propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one.
We prove that they both attain $tildemathcalO(fracexp(|beta| H)-1|beta|HsqrtS2AK)$ regret upper bound, where $S$, $A$, $K$, and $H$ represent the number
arXiv Detail & Related papers (2022-10-25T14:30:48Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Optimal Dynamic Regret in LQR Control [23.91519151164528]
We consider the problem of nonstochastic control with a sequence of quadratic losses, i.e., LQR control.
We provide an online algorithm that achieves an optimal dynamic (policy) regret of $tildeO(textmaxn1/3 mathcalTV(M_1:n)2/3, 1)$.
arXiv Detail & Related papers (2022-06-18T18:00:21Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - A Provably-Efficient Model-Free Algorithm for Constrained Markov
Decision Processes [13.877420496703627]
This paper presents the first em model-free, em simulator-free reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation.
The algorithm is named Triple-Q because it has three key components: a Q-function for the cumulative reward, a Q-function for the cumulative utility for the constraint, and a virtual-Queue that (over)-estimates the cumulative constraint violation.
arXiv Detail & Related papers (2021-06-03T03:53:27Z) - Provably Efficient Reinforcement Learning with Linear Function
Approximation Under Adaptivity Constraints [94.76881135901753]
We consider two popular limited adaptivity models: batch learning model and rare policy switch model.
Our proposed LSVI-UCB-Batch algorithm achieves an $tilde O(sqrtd3H3T + dHT/B)$ regret.
For the rare policy switch model, our proposed LSVI-UCB-RareSwitch algorithm enjoys an $tilde O(sqrtd3H3T[1+T/(dH)]dH/B)$ regret.
arXiv Detail & Related papers (2021-01-06T18:56:07Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z) - Learning in Markov Decision Processes under Constraints [34.03325546283489]
We consider reinforcement learning in Markov Decision Processes in which an agent repeatedly interacts with an environment that is modeled by a controlled Markov process.
We design model-based RL algorithms that maximize the cumulative reward earned over a time horizon of $T$ time-steps.
We show how to reduce the regret of a desired subset of the $M$ costs, at the expense of increasing the regrets of rewards and the remaining costs.
arXiv Detail & Related papers (2020-02-27T20:58:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.