Multi-Armed Bandits with Generalized Temporally-Partitioned Rewards
- URL: http://arxiv.org/abs/2303.00620v1
- Date: Wed, 1 Mar 2023 16:22:22 GMT
- Title: Multi-Armed Bandits with Generalized Temporally-Partitioned Rewards
- Authors: Ronald C. van den Broek, Rik Litjens, Tobias Sagis, Luc Siecker, Nina
Verbeeke, Pratik Gajane
- Abstract summary: In some real-world applications, feedback about a decision is delayed and may arrive via partial rewards that are observed with different delays.
We propose a novel problem formulation called multi-armed bandits with generalized temporally-partitioned rewards.
We derive a lower bound on the performance of any uniformly efficient algorithm for the considered problem.
- Score: 0.4194295877935867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Decision-making problems of sequential nature, where decisions made in the
past may have an impact on the future, are used to model many practically
important applications. In some real-world applications, feedback about a
decision is delayed and may arrive via partial rewards that are observed with
different delays. Motivated by such scenarios, we propose a novel problem
formulation called multi-armed bandits with generalized temporally-partitioned
rewards. To formalize how feedback about a decision is partitioned across
several time steps, we introduce $\beta$-spread property. We derive a lower
bound on the performance of any uniformly efficient algorithm for the
considered problem. Moreover, we provide an algorithm called TP-UCB-FR-G and
prove an upper bound on its performance measure. In some scenarios, our upper
bound improves upon the state of the art. We provide experimental results
validating the proposed algorithm and our theoretical results.
Related papers
- Sound Heuristic Search Value Iteration for Undiscounted POMDPs with Reachability Objectives [16.101435842520473]
This paper studies the challenging yet important problem in POMDPs known as the (indefinite-horizon) Maximal Reachability Probability Problem.
Inspired by the success of point-based methods developed for discounted problems, we study their extensions to MRPP.
We present a novel algorithm that leverages the strengths of these techniques for efficient exploration of the belief space.
arXiv Detail & Related papers (2024-06-05T02:33:50Z) - Provably Efficient UCB-type Algorithms For Learning Predictive State
Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs)
This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models.
In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z) - A Reduction-based Framework for Sequential Decision Making with Delayed
Feedback [53.79893086002961]
We study delayed feedback in general multi-agent sequential decision making.
We propose a novel reduction-based framework, which turns any multi-batched algorithm for sequential decision making with instantaneous feedback into a sample-efficient algorithm.
arXiv Detail & Related papers (2023-02-03T01:16:09Z) - Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When
Partial Feedback Counts [53.579515853222986]
We study a novel bandit setting, namely Multi-Armed Bandit with Temporally-Partitioned Rewards (TP-MAB)
This setting is a natural extension of delayed-feedback bandits to the case in which rewards may be dilated over a finite-time span after the pull.
We provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and TP-UCB-EW.
arXiv Detail & Related papers (2022-06-01T15:56:59Z) - Algorithmic Challenges in Ensuring Fairness at the Time of Decision [6.228560624452748]
Algorithmic decision-making in societal contexts can be framed as optimization under bandit feedback.
Recent lawsuits accuse companies that deploy algorithmic pricing practices of price gouging.
We introduce the well-studied fairness notion of envy-freeness within the context of convex optimization.
arXiv Detail & Related papers (2021-03-16T19:06:28Z) - Batched Neural Bandits [107.5072688105936]
BatchNeuralUCB combines neural networks with optimism to address the exploration-exploitation tradeoff.
We prove that BatchNeuralUCB achieves the same regret as the fully sequential version while reducing the number of policy updates considerably.
arXiv Detail & Related papers (2021-02-25T17:36:44Z) - Stein Variational Model Predictive Control [130.60527864489168]
Decision making under uncertainty is critical to real-world, autonomous systems.
Model Predictive Control (MPC) methods have demonstrated favorable performance in practice, but remain limited when dealing with complex distributions.
We show that this framework leads to successful planning in challenging, non optimal control problems.
arXiv Detail & Related papers (2020-11-15T22:36:59Z) - Optimizing for the Future in Non-Stationary MDPs [52.373873622008944]
We present a policy gradient algorithm that maximizes a forecast of future performance.
We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques.
arXiv Detail & Related papers (2020-05-17T03:41:19Z) - A Farewell to Arms: Sequential Reward Maximization on a Budget with a
Giving Up Option [5.1629297054995265]
We consider a sequential decision-making problem where an agent can take one action at a time and each action has a temporal extent.
We introduce an upper confidence based-algorithm, WAIT-UCB, for which we establish logarithmic, problem-dependent regret bound.
arXiv Detail & Related papers (2020-03-06T22:16:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.