Dynamic Planning and Learning under Recovering Rewards
- URL: http://arxiv.org/abs/2106.14813v1
- Date: Mon, 28 Jun 2021 15:40:07 GMT
- Title: Dynamic Planning and Learning under Recovering Rewards
- Authors: David Simchi-Levi, Zeyu Zheng, Feng Zhu
- Abstract summary: We propose, construct and prove performance guarantees for a class of "Purely Periodic Policies"
Our framework and policy design may have the potential to be adapted into other offline planning and online learning applications.
- Score: 18.829837839926956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motivated by emerging applications such as live-streaming e-commerce,
promotions and recommendations, we introduce a general class of multi-armed
bandit problems that have the following two features: (i) the decision maker
can pull and collect rewards from at most $K$ out of $N$ different arms in each
time period; (ii) the expected reward of an arm immediately drops after it is
pulled, and then non parametrically recovers as the idle time increases. With
the objective of maximizing expected cumulative rewards over $T$ time periods,
we propose, construct and prove performance guarantees for a class of "Purely
Periodic Policies". For the offline problem when all model parameters are
known, our proposed policy obtains an approximation ratio that is at the order
of $1-\mathcal O(1/\sqrt{K})$, which is asymptotically optimal when $K$ grows
to infinity. For the online problem when the model parameters are unknown and
need to be learned, we design an Upper Confidence Bound (UCB) based policy that
approximately has $\widetilde{\mathcal O}(N\sqrt{T})$ regret against the
offline benchmark. Our framework and policy design may have the potential to be
adapted into other offline planning and online learning applications with
non-stationary and recovering rewards.
Related papers
- Learning to Cover: Online Learning and Optimization with Irreversible Decisions [50.5775508521174]
We find that regret grows sub-linearly at a rate $Thetaleft(mfrac12cdotfrac11-2-Tright)$, thus converging exponentially fast to $Theta(sqrtm)$.
These findings underscore the benefits of limited online learning and optimization, in that even a few rounds can provide significant benefits as compared to a no-learning baseline.
arXiv Detail & Related papers (2024-06-20T23:00:25Z) - Online Policy Learning and Inference by Matrix Completion [12.527541242185404]
We formulate the problem as a matrix completion bandit (MCB)
We explore the $epsilon$-greedy bandit and the online gradient descent descent.
A faster decaying exploration yields smaller regret but learns the optimal policy less accurately.
arXiv Detail & Related papers (2024-04-26T13:19:27Z) - Online Resource Allocation in Episodic Markov Decision Processes [1.8416014644193066]
We formulate the problem as an online allocation problem in an episodic finite-horizon constrained Markov decision process.
We propose the observe-then-decide regime and improve the existing decide-then-observe regime.
We show that the regret against the static optimal policy that has access to the mean reward and mean resource consumption functions is bounded by $tilde O(rho-1H3/2SsqrtAT)$ with high probability.
arXiv Detail & Related papers (2023-05-18T06:28:34Z) - Autoregressive Bandits [58.46584210388307]
We propose a novel online learning setting, Autoregressive Bandits, in which the observed reward is governed by an autoregressive process of order $k$.
We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed.
We then devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $widetildemathcalO left( frac(k+1)3/2sqrtnT (1-G
arXiv Detail & Related papers (2022-12-12T21:37:36Z) - A Unifying Framework for Online Optimization with Long-Term Constraints [62.35194099438855]
We study online learning problems in which a decision maker has to take a sequence of decisions subject to $m$ long-term constraints.
The goal is to maximize their total reward, while at the same time achieving small cumulative violation across the $T$ rounds.
We present the first best-of-both-world type algorithm for this general class problems, with no-regret guarantees both in the case in which rewards and constraints are selected according to an unknown model, and in the case in which they are selected at each round by an adversary.
arXiv Detail & Related papers (2022-09-15T16:59:19Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Online Learning with Knapsacks: the Best of Both Worlds [54.28273783164608]
We casting online learning problems in which a decision maker wants to maximize their expected reward without violating a finite set of $m$m resource constraints.
Our framework allows the decision maker to handle its evidence flexibility and costoretic functions.
arXiv Detail & Related papers (2022-02-28T12:10:48Z) - PDE-Based Optimal Strategy for Unconstrained Online Learning [40.61498562988079]
We present a framework that generates time-varying potential functions by solving a Partial Differential Equation (PDE)
Our framework recovers some classical potentials, and more importantly provides a systematic approach to design new ones.
This is the first parameter-free algorithm with optimal leading constant.
arXiv Detail & Related papers (2022-01-19T22:21:21Z) - Provably Efficient Generative Adversarial Imitation Learning for Online
and Offline Setting with Linear Function Approximation [81.0955457177017]
In generative adversarial imitation learning (GAIL), the agent aims to learn a policy from an expert demonstration so that its performance cannot be discriminated from the expert policy on a certain reward set.
We study GAIL in both online and offline settings with linear function approximation, where both the transition and reward function are linear in the feature maps.
arXiv Detail & Related papers (2021-08-19T16:16:00Z) - Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation
for Reinforcement Learning [43.61029925616256]
offline policy evaluation in Reinforcement Learning (RL) is a critical step towards applying RL in real-life applications.
We address this problem by simultaneously evaluating all policies in a policy class $Pi$ -- uniform convergence in OPE.
Our results imply that the model-based planning achieves an optimal episode complexity of $widetildeO(H3/d_mepsilon2)$ in identifying an $epsilon$-optimal policy.
arXiv Detail & Related papers (2020-07-07T19:44:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.