Related papers: Model-Free Non-Stationary RL: Near-Optimal Regret and Applications in Multi-Agent RL and Inventory Control

Model-Free Non-Stationary RL: Near-Optimal Regret and Applications in Multi-Agent RL and Inventory Control

URL: http://arxiv.org/abs/2010.03161v4
Date: Sat, 20 Aug 2022 02:54:31 GMT
Title: Model-Free Non-Stationary RL: Near-Optimal Regret and Applications in Multi-Agent RL and Inventory Control
Authors: Weichao Mao, Kaiqing Zhang, Ruihao Zhu, David Simchi-Levi, Tamer Ba\c{s}ar
Abstract summary: We propose Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), the first model-free algorithm for non-stationary RL. We show that our algorithms are emphnearly optimal by establishing an information-theoretical lower bound of $Omega(Sfrac13 Afrac13 Deltafrac13 Hfrac23 Tfrac23)$, the first lower bound in non-stationary RL.
Score: 28.80743320843154
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider model-free reinforcement learning (RL) in non-stationary Markov decision processes. Both the reward functions and the state transition functions are allowed to vary arbitrarily over time as long as their cumulative variations do not exceed certain variation budgets. We propose Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), the first model-free algorithm for non-stationary RL, and show that it outperforms existing solutions in terms of dynamic regret. Specifically, RestartQ-UCB with Freedman-type bonus terms achieves a dynamic regret bound of $\widetilde{O}(S^{\frac{1}{3}} A^{\frac{1}{3}} \Delta^{\frac{1}{3}} H T^{\frac{2}{3}})$, where $S$ and $A$ are the numbers of states and actions, respectively, $\Delta>0$ is the variation budget, $H$ is the number of time steps per episode, and $T$ is the total number of time steps. We further present a parameter-free algorithm named Double-Restart Q-UCB that does not require prior knowledge of the variation budget. We show that our algorithms are \emph{nearly optimal} by establishing an information-theoretical lower bound of $\Omega(S^{\frac{1}{3}} A^{\frac{1}{3}} \Delta^{\frac{1}{3}} H^{\frac{2}{3}} T^{\frac{2}{3}})$, the first lower bound in non-stationary RL. Numerical experiments validate the advantages of RestartQ-UCB in terms of both cumulative rewards and computational efficiency. We demonstrate the power of our results in examples of multi-agent RL and inventory control across related products.

Related papers

Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost. We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z)
Near-Optimal Adversarial Reinforcement Learning with Switching Costs [43.895798638743784]
We show how to develop a provably efficient algorithm for adversarial RL with switching costs. Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret is no longer achievable. We propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known.
arXiv Detail & Related papers (2023-02-08T23:41:29Z)
Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds [24.571530193140916]
We consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return. We propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one. We prove that they both attain $tildemathcalO(fracexp(|beta| H)-1|beta|HsqrtS2AK)$ regret upper bound, where $S$, $A$, $K$, and $H$ represent the number
arXiv Detail & Related papers (2022-10-25T14:30:48Z)
Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight. We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z)
Optimal Dynamic Regret in LQR Control [23.91519151164528]
We consider the problem of nonstochastic control with a sequence of quadratic losses, i.e., LQR control. We provide an online algorithm that achieves an optimal dynamic (policy) regret of $tildeO(textmaxn1/3 mathcalTV(M_1:n)2/3, 1)$.
arXiv Detail & Related papers (2022-06-18T18:00:21Z)
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences. Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z)
Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning. The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z)
A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes [13.877420496703627]
This paper presents the first em model-free, em simulator-free reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation. The algorithm is named Triple-Q because it has three key components: a Q-function for the cumulative reward, a Q-function for the cumulative utility for the constraint, and a virtual-Queue that (over)-estimates the cumulative constraint violation.
arXiv Detail & Related papers (2021-06-03T03:53:27Z)
Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints [94.76881135901753]
We consider two popular limited adaptivity models: batch learning model and rare policy switch model. Our proposed LSVI-UCB-Batch algorithm achieves an $tilde O(sqrtd3H3T + dHT/B)$ regret. For the rare policy switch model, our proposed LSVI-UCB-RareSwitch algorithm enjoys an $tilde O(sqrtd3H3T[1+T/(dH)]dH/B)$ regret.
arXiv Detail & Related papers (2021-01-06T18:56:07Z)
Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error. We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)
Learning in Markov Decision Processes under Constraints [34.03325546283489]
We consider reinforcement learning in Markov Decision Processes in which an agent repeatedly interacts with an environment that is modeled by a controlled Markov process. We design model-based RL algorithms that maximize the cumulative reward earned over a time horizon of $T$ time-steps. We show how to reduce the regret of a desired subset of the $M$ costs, at the expense of increasing the regrets of rewards and the remaining costs.
arXiv Detail & Related papers (2020-02-27T20:58:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.