Dynamic Planning and Learning under Recovering Rewards
- URL: http://arxiv.org/abs/2106.14813v1
- Date: Mon, 28 Jun 2021 15:40:07 GMT
- Title: Dynamic Planning and Learning under Recovering Rewards
- Authors: David Simchi-Levi, Zeyu Zheng, Feng Zhu
- Abstract summary: We propose, construct and prove performance guarantees for a class of "Purely Periodic Policies"
Our framework and policy design may have the potential to be adapted into other offline planning and online learning applications.
- Score: 18.829837839926956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motivated by emerging applications such as live-streaming e-commerce,
promotions and recommendations, we introduce a general class of multi-armed
bandit problems that have the following two features: (i) the decision maker
can pull and collect rewards from at most $K$ out of $N$ different arms in each
time period; (ii) the expected reward of an arm immediately drops after it is
pulled, and then non parametrically recovers as the idle time increases. With
the objective of maximizing expected cumulative rewards over $T$ time periods,
we propose, construct and prove performance guarantees for a class of "Purely
Periodic Policies". For the offline problem when all model parameters are
known, our proposed policy obtains an approximation ratio that is at the order
of $1-\mathcal O(1/\sqrt{K})$, which is asymptotically optimal when $K$ grows
to infinity. For the online problem when the model parameters are unknown and
need to be learned, we design an Upper Confidence Bound (UCB) based policy that
approximately has $\widetilde{\mathcal O}(N\sqrt{T})$ regret against the
offline benchmark. Our framework and policy design may have the potential to be
adapted into other offline planning and online learning applications with
non-stationary and recovering rewards.
Related papers
- $TAR^2$: Temporal-Agent Reward Redistribution for Optimal Policy Preservation in Multi-Agent Reinforcement Learning [7.97295726921338]
Temporal-Agent Reward Redistribution $TAR2$ is a novel approach that decomposes sparse global rewards into agent-specific, time-step-specific components.
We show that $TAR2$ aligns with potential-based reward shaping, preserving the same optimal policies as the original environment.
arXiv Detail & Related papers (2025-02-07T12:07:57Z) - Learning While Repositioning in On-Demand Vehicle Sharing Networks [4.724825031148413]
We consider a network inventory problem motivated by one-way, on-demand vehicle sharing services.
We show that a natural Lipschitz-bandit approach achieves a regret guarantee of $widetildeO(Tfracnn+1)$, which suffers from the exponential dependence on $n$.
Motivated by these challenges, we propose an Online Gradient Repositioning algorithm that relies solely on censored demand.
arXiv Detail & Related papers (2025-01-31T15:16:02Z) - Accelerating Proximal Policy Optimization Learning Using Task Prediction for Solving Environments with Delayed Rewards [8.455772877963792]
We introduce two key enhancements to PPO: a hybrid policy architecture that combines an offline policy with an online PPO policy, and a reward shaping mechanism using Time Window Temporal Logic (TWTL)
We demonstrate the effectiveness of our approach through extensive experiments on an inverted pendulum and a lunar lander environments.
arXiv Detail & Related papers (2024-11-26T20:22:31Z) - Learning to Cover: Online Learning and Optimization with Irreversible Decisions [50.5775508521174]
We find that regret grows sub-linearly at a rate $Thetaleft(mfrac12cdotfrac11-2-Tright)$, thus converging exponentially fast to $Theta(sqrtm)$.
These findings underscore the benefits of limited online learning and optimization, in that even a few rounds can provide significant benefits as compared to a no-learning baseline.
arXiv Detail & Related papers (2024-06-20T23:00:25Z) - Online Resource Allocation in Episodic Markov Decision Processes [1.8416014644193066]
We formulate the problem as an online allocation problem in an episodic finite-horizon constrained Markov decision process.
We propose the observe-then-decide regime and improve the existing decide-then-observe regime.
We show that the regret against the static optimal policy that has access to the mean reward and mean resource consumption functions is bounded by $tilde O(rho-1H3/2SsqrtAT)$ with high probability.
arXiv Detail & Related papers (2023-05-18T06:28:34Z) - Autoregressive Bandits [58.46584210388307]
We propose a novel online learning setting, Autoregressive Bandits, in which the observed reward is governed by an autoregressive process of order $k$.
We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed.
We then devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $widetildemathcalO left( frac(k+1)3/2sqrtnT (1-G
arXiv Detail & Related papers (2022-12-12T21:37:36Z) - A Unifying Framework for Online Optimization with Long-Term Constraints [62.35194099438855]
We study online learning problems in which a decision maker has to take a sequence of decisions subject to $m$ long-term constraints.
The goal is to maximize their total reward, while at the same time achieving small cumulative violation across the $T$ rounds.
We present the first best-of-both-world type algorithm for this general class problems, with no-regret guarantees both in the case in which rewards and constraints are selected according to an unknown model, and in the case in which they are selected at each round by an adversary.
arXiv Detail & Related papers (2022-09-15T16:59:19Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Online Learning with Knapsacks: the Best of Both Worlds [54.28273783164608]
We casting online learning problems in which a decision maker wants to maximize their expected reward without violating a finite set of $m$m resource constraints.
Our framework allows the decision maker to handle its evidence flexibility and costoretic functions.
arXiv Detail & Related papers (2022-02-28T12:10:48Z) - Provably Efficient Generative Adversarial Imitation Learning for Online
and Offline Setting with Linear Function Approximation [81.0955457177017]
In generative adversarial imitation learning (GAIL), the agent aims to learn a policy from an expert demonstration so that its performance cannot be discriminated from the expert policy on a certain reward set.
We study GAIL in both online and offline settings with linear function approximation, where both the transition and reward function are linear in the feature maps.
arXiv Detail & Related papers (2021-08-19T16:16:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.