Related papers: Multi-Armed Bandits with Generalized Temporally-Partitioned Rewards

Multi-Armed Bandits with Generalized Temporally-Partitioned Rewards

URL: http://arxiv.org/abs/2303.00620v1
Date: Wed, 1 Mar 2023 16:22:22 GMT
Title: Multi-Armed Bandits with Generalized Temporally-Partitioned Rewards
Authors: Ronald C. van den Broek, Rik Litjens, Tobias Sagis, Luc Siecker, Nina Verbeeke, Pratik Gajane
Abstract summary: In some real-world applications, feedback about a decision is delayed and may arrive via partial rewards that are observed with different delays. We propose a novel problem formulation called multi-armed bandits with generalized temporally-partitioned rewards. We derive a lower bound on the performance of any uniformly efficient algorithm for the considered problem.
Score: 0.4194295877935867
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Decision-making problems of sequential nature, where decisions made in the past may have an impact on the future, are used to model many practically important applications. In some real-world applications, feedback about a decision is delayed and may arrive via partial rewards that are observed with different delays. Motivated by such scenarios, we propose a novel problem formulation called multi-armed bandits with generalized temporally-partitioned rewards. To formalize how feedback about a decision is partitioned across several time steps, we introduce $\beta$-spread property. We derive a lower bound on the performance of any uniformly efficient algorithm for the considered problem. Moreover, we provide an algorithm called TP-UCB-FR-G and prove an upper bound on its performance measure. In some scenarios, our upper bound improves upon the state of the art. We provide experimental results validating the proposed algorithm and our theoretical results.

Related papers

Neural Contextual Bandits Under Delayed Feedback Constraints [3.823356975862005]
This paper presents a new algorithm for neural contextual bandits (CBs) that addresses the challenge of delayed reward feedback. The proposed algorithm, called Delayed NeuralUCB, uses an upper confidence bound (UCB)-based exploration strategy. Numerical experiments on real-world datasets, such as MNIST and Mushroom, demonstrate that the proposed algorithms effectively manage varying delays.
arXiv Detail & Related papers (2025-04-16T13:47:25Z)
Sound Heuristic Search Value Iteration for Undiscounted POMDPs with Reachability Objectives [16.101435842520473]
This paper studies the challenging yet important problem in POMDPs known as the (indefinite-horizon) Maximal Reachability Probability Problem. Inspired by the success of point-based methods developed for discounted problems, we study their extensions to MRPP. We present a novel algorithm that leverages the strengths of these techniques for efficient exploration of the belief space.
arXiv Detail & Related papers (2024-06-05T02:33:50Z)
Provably Efficient UCB-type Algorithms For Learning Predictive State Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs) This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models. In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z)
A Reduction-based Framework for Sequential Decision Making with Delayed Feedback [53.79893086002961]
We study delayed feedback in general multi-agent sequential decision making. We propose a novel reduction-based framework, which turns any multi-batched algorithm for sequential decision making with instantaneous feedback into a sample-efficient algorithm.
arXiv Detail & Related papers (2023-02-03T01:16:09Z)
Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When Partial Feedback Counts [53.579515853222986]
We study a novel bandit setting, namely Multi-Armed Bandit with Temporally-Partitioned Rewards (TP-MAB) This setting is a natural extension of delayed-feedback bandits to the case in which rewards may be dilated over a finite-time span after the pull. We provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and TP-UCB-EW.
arXiv Detail & Related papers (2022-06-01T15:56:59Z)
Algorithmic Challenges in Ensuring Fairness at the Time of Decision [6.228560624452748]
Algorithmic decision-making in societal contexts can be framed as optimization under bandit feedback. Recent lawsuits accuse companies that deploy algorithmic pricing practices of price gouging. We introduce the well-studied fairness notion of envy-freeness within the context of convex optimization.
arXiv Detail & Related papers (2021-03-16T19:06:28Z)
Batched Neural Bandits [107.5072688105936]
BatchNeuralUCB combines neural networks with optimism to address the exploration-exploitation tradeoff. We prove that BatchNeuralUCB achieves the same regret as the fully sequential version while reducing the number of policy updates considerably.
arXiv Detail & Related papers (2021-02-25T17:36:44Z)
Stein Variational Model Predictive Control [130.60527864489168]
Decision making under uncertainty is critical to real-world, autonomous systems. Model Predictive Control (MPC) methods have demonstrated favorable performance in practice, but remain limited when dealing with complex distributions. We show that this framework leads to successful planning in challenging, non optimal control problems.
arXiv Detail & Related papers (2020-11-15T22:36:59Z)
Optimizing for the Future in Non-Stationary MDPs [52.373873622008944]
We present a policy gradient algorithm that maximizes a forecast of future performance. We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques.
arXiv Detail & Related papers (2020-05-17T03:41:19Z)
A Farewell to Arms: Sequential Reward Maximization on a Budget with a Giving Up Option [5.1629297054995265]
We consider a sequential decision-making problem where an agent can take one action at a time and each action has a temporal extent. We introduce an upper confidence based-algorithm, WAIT-UCB, for which we establish logarithmic, problem-dependent regret bound.
arXiv Detail & Related papers (2020-03-06T22:16:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.