Related papers: Confidence-Budget Matching for Sequential Budgeted Learning

Confidence-Budget Matching for Sequential Budgeted Learning

URL: http://arxiv.org/abs/2102.03400v1
Date: Fri, 5 Feb 2021 19:56:31 GMT
Title: Confidence-Budget Matching for Sequential Budgeted Learning
Authors: Yonathan Efroni, Nadav Merlis, Aadirupa Saha, Shie Mannor
Abstract summary: We formalize decision-making problems with querying budget. We consider multi-armed bandits, linear bandits, and reinforcement learning problems. We show that CBM based algorithms perform well in the presence of adversity.
Score: 69.77435313099366
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A core element in decision-making under uncertainty is the feedback on the quality of the performed actions. However, in many applications, such feedback is restricted. For example, in recommendation systems, repeatedly asking the user to provide feedback on the quality of recommendations will annoy them. In this work, we formalize decision-making problems with querying budget, where there is a (possibly time-dependent) hard limit on the number of reward queries allowed. Specifically, we consider multi-armed bandits, linear bandits, and reinforcement learning problems. We start by analyzing the performance of `greedy' algorithms that query a reward whenever they can. We show that in fully stochastic settings, doing so performs surprisingly well, but in the presence of any adversity, this might lead to linear regret. To overcome this issue, we propose the Confidence-Budget Matching (CBM) principle that queries rewards when the confidence intervals are wider than the inverse square root of the available budget. We analyze the performance of CBM based algorithms in different settings and show that they perform well in the presence of adversity in the contexts, initial states, and budgets.

Related papers

Best Arm Identification with Fixed Budget: A Large Deviation Perspective [54.305323903582845]
We present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms. In particular, we present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
arXiv Detail & Related papers (2023-12-19T13:17:43Z)
Likelihood Ratio Confidence Sets for Sequential Decision Making [51.66638486226482]
We revisit the likelihood-based inference principle and propose to use likelihood ratios to construct valid confidence sequences. Our method is especially suitable for problems with well-specified likelihoods. We show how to provably choose the best sequence of estimators and shed light on connections to online convex optimization.
arXiv Detail & Related papers (2023-11-08T00:10:21Z)
Best Arm Identification for Stochastic Rising Bandits [84.55453174601826]
Rising Bandits (SRBs) model sequential decision-making problems in which the expected reward of the available options increases every time they are selected. This paper focuses on the fixed-budget Best Arm Identification (BAI) problem for SRBs. We propose two algorithms to tackle the above-mentioned setting, namely R-UCBE and R-SR.
arXiv Detail & Related papers (2023-02-15T08:01:37Z)
A Unifying Framework for Online Optimization with Long-Term Constraints [62.35194099438855]
We study online learning problems in which a decision maker has to take a sequence of decisions subject to $m$ long-term constraints. The goal is to maximize their total reward, while at the same time achieving small cumulative violation across the $T$ rounds. We present the first best-of-both-world type algorithm for this general class problems, with no-regret guarantees both in the case in which rewards and constraints are selected according to an unknown model, and in the case in which they are selected at each round by an adversary.
arXiv Detail & Related papers (2022-09-15T16:59:19Z)
Risk-aware linear bandits with convex loss [0.0]
We propose an optimistic UCB algorithm to learn optimal risk-aware actions, with regret guarantees similar to those of generalized linear bandits. This approach requires solving a convex problem at each round of the algorithm, which we can relax by allowing only approximated solution obtained by online gradient descent.
arXiv Detail & Related papers (2022-09-15T09:09:53Z)
Dare not to Ask: Problem-Dependent Guarantees for Budgeted Bandits [66.02233330016435]
We provide problem-dependent guarantees on both the regret and the asked feedback. We present a new algorithm called BuFALU for which we derive problem-dependent regret and cumulative feedback bounds.
arXiv Detail & Related papers (2021-10-12T03:24:57Z)
Algorithmic Challenges in Ensuring Fairness at the Time of Decision [6.228560624452748]
Algorithmic decision-making in societal contexts can be framed as optimization under bandit feedback. Recent lawsuits accuse companies that deploy algorithmic pricing practices of price gouging. We introduce the well-studied fairness notion of envy-freeness within the context of convex optimization.
arXiv Detail & Related papers (2021-03-16T19:06:28Z)
A One-Size-Fits-All Solution to Conservative Bandit Problems [32.907883501286]
We study a family of conservative bandit problems (CBPs) with sample-path reward constraints. We propose a One-Size-Fits-All solution to CBPs and present its applications to three encompassed problems.
arXiv Detail & Related papers (2020-12-14T08:50:23Z)
Reward Biased Maximum Likelihood Estimation for Reinforcement Learning [13.820705458648233]
Reward-Biased Maximum Likelihood Estimate (RBMLE) for adaptive control of Markov chains was proposed. We show that it has a regret of $mathcalO( log T)$ over a time horizon of $T$ steps, similar to state-of-the-art algorithms.
arXiv Detail & Related papers (2020-11-16T06:09:56Z)
Constrained Upper Confidence Reinforcement Learning [12.919486518128734]
This paper extends upper confidence reinforcement learning for settings in which the reward function and the constraints, described by cost functions, are unknown a priori. We show that an algorithm C-UCRL achieves sub-linear regret with respect to the reward while satisfying the constraints even while learning with probability $1-delta$.
arXiv Detail & Related papers (2020-01-26T00:23:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.