Related papers: Dynamic Planning and Learning under Recovering Rewards

Dynamic Planning and Learning under Recovering Rewards

URL: http://arxiv.org/abs/2106.14813v1
Date: Mon, 28 Jun 2021 15:40:07 GMT
Title: Dynamic Planning and Learning under Recovering Rewards
Authors: David Simchi-Levi, Zeyu Zheng, Feng Zhu
Abstract summary: We propose, construct and prove performance guarantees for a class of "Purely Periodic Policies" Our framework and policy design may have the potential to be adapted into other offline planning and online learning applications.
Score: 18.829837839926956
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Motivated by emerging applications such as live-streaming e-commerce, promotions and recommendations, we introduce a general class of multi-armed bandit problems that have the following two features: (i) the decision maker can pull and collect rewards from at most $K$ out of $N$ different arms in each time period; (ii) the expected reward of an arm immediately drops after it is pulled, and then non parametrically recovers as the idle time increases. With the objective of maximizing expected cumulative rewards over $T$ time periods, we propose, construct and prove performance guarantees for a class of "Purely Periodic Policies". For the offline problem when all model parameters are known, our proposed policy obtains an approximation ratio that is at the order of $1-\mathcal O(1/\sqrt{K})$, which is asymptotically optimal when $K$ grows to infinity. For the online problem when the model parameters are unknown and need to be learned, we design an Upper Confidence Bound (UCB) based policy that approximately has $\widetilde{\mathcal O}(N\sqrt{T})$ regret against the offline benchmark. Our framework and policy design may have the potential to be adapted into other offline planning and online learning applications with non-stationary and recovering rewards.

Related papers

$TAR^2$: Temporal-Agent Reward Redistribution for Optimal Policy Preservation in Multi-Agent Reinforcement Learning [7.97295726921338]
Temporal-Agent Reward Redistribution $TAR2$ is a novel approach that decomposes sparse global rewards into agent-specific, time-step-specific components. We show that $TAR2$ aligns with potential-based reward shaping, preserving the same optimal policies as the original environment.
arXiv Detail & Related papers (2025-02-07T12:07:57Z)
Learning While Repositioning in On-Demand Vehicle Sharing Networks [4.724825031148413]
We consider a network inventory problem motivated by one-way, on-demand vehicle sharing services. We show that a natural Lipschitz-bandit approach achieves a regret guarantee of $widetildeO(Tfracnn+1)$, which suffers from the exponential dependence on $n$. Motivated by these challenges, we propose an Online Gradient Repositioning algorithm that relies solely on censored demand.
arXiv Detail & Related papers (2025-01-31T15:16:02Z)
Accelerating Proximal Policy Optimization Learning Using Task Prediction for Solving Environments with Delayed Rewards [8.455772877963792]
We introduce two key enhancements to PPO: a hybrid policy architecture that combines an offline policy with an online PPO policy, and a reward shaping mechanism using Time Window Temporal Logic (TWTL) We demonstrate the effectiveness of our approach through extensive experiments on an inverted pendulum and a lunar lander environments.
arXiv Detail & Related papers (2024-11-26T20:22:31Z)
Learning to Cover: Online Learning and Optimization with Irreversible Decisions [50.5775508521174]
We find that regret grows sub-linearly at a rate $Thetaleft(mfrac12cdotfrac11-2-Tright)$, thus converging exponentially fast to $Theta(sqrtm)$. These findings underscore the benefits of limited online learning and optimization, in that even a few rounds can provide significant benefits as compared to a no-learning baseline.
arXiv Detail & Related papers (2024-06-20T23:00:25Z)
Online Policy Learning and Inference by Matrix Completion [12.527541242185404]
We formulate the problem as a matrix completion bandit (MCB) We explore the $epsilon$-greedy bandit and the online gradient descent descent. A faster decaying exploration yields smaller regret but learns the optimal policy less accurately.
arXiv Detail & Related papers (2024-04-26T13:19:27Z)
Online Resource Allocation in Episodic Markov Decision Processes [1.8416014644193066]
We formulate the problem as an online allocation problem in an episodic finite-horizon constrained Markov decision process. We propose the observe-then-decide regime and improve the existing decide-then-observe regime. We show that the regret against the static optimal policy that has access to the mean reward and mean resource consumption functions is bounded by $tilde O(rho-1H3/2SsqrtAT)$ with high probability.
arXiv Detail & Related papers (2023-05-18T06:28:34Z)
Autoregressive Bandits [58.46584210388307]
We propose a novel online learning setting, Autoregressive Bandits, in which the observed reward is governed by an autoregressive process of order $k$. We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed. We then devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $widetildemathcalO left( frac(k+1)3/2sqrtnT (1-G
arXiv Detail & Related papers (2022-12-12T21:37:36Z)
A Unifying Framework for Online Optimization with Long-Term Constraints [62.35194099438855]
We study online learning problems in which a decision maker has to take a sequence of decisions subject to $m$ long-term constraints. The goal is to maximize their total reward, while at the same time achieving small cumulative violation across the $T$ rounds. We present the first best-of-both-world type algorithm for this general class problems, with no-regret guarantees both in the case in which rewards and constraints are selected according to an unknown model, and in the case in which they are selected at each round by an adversary.
arXiv Detail & Related papers (2022-09-15T16:59:19Z)
Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED) PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z)
Online Learning with Knapsacks: the Best of Both Worlds [54.28273783164608]
We casting online learning problems in which a decision maker wants to maximize their expected reward without violating a finite set of $m$m resource constraints. Our framework allows the decision maker to handle its evidence flexibility and costoretic functions.
arXiv Detail & Related papers (2022-02-28T12:10:48Z)
PDE-Based Optimal Strategy for Unconstrained Online Learning [40.61498562988079]
We present a framework that generates time-varying potential functions by solving a Partial Differential Equation (PDE) Our framework recovers some classical potentials, and more importantly provides a systematic approach to design new ones. This is the first parameter-free algorithm with optimal leading constant.
arXiv Detail & Related papers (2022-01-19T22:21:21Z)
Provably Efficient Generative Adversarial Imitation Learning for Online and Offline Setting with Linear Function Approximation [81.0955457177017]
In generative adversarial imitation learning (GAIL), the agent aims to learn a policy from an expert demonstration so that its performance cannot be discriminated from the expert policy on a certain reward set. We study GAIL in both online and offline settings with linear function approximation, where both the transition and reward function are linear in the feature maps.
arXiv Detail & Related papers (2021-08-19T16:16:00Z)
Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation for Reinforcement Learning [43.61029925616256]
offline policy evaluation in Reinforcement Learning (RL) is a critical step towards applying RL in real-life applications. We address this problem by simultaneously evaluating all policies in a policy class $Pi$ -- uniform convergence in OPE. Our results imply that the model-based planning achieves an optimal episode complexity of $widetildeO(H3/d_mepsilon2)$ in identifying an $epsilon$-optimal policy.
arXiv Detail & Related papers (2020-07-07T19:44:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.