Multi-armed Bandits with Cost Subsidy
- URL: http://arxiv.org/abs/2011.01488v3
- Date: Mon, 15 Mar 2021 11:49:38 GMT
- Title: Multi-armed Bandits with Cost Subsidy
- Authors: Deeksha Sinha, Karthik Abinav Sankararama, Abbas Kazerouni, Vashist
Avadhanula
- Abstract summary: We present two applications, intelligent SMS routing problem and ad audience optimization problem.
We show that naive generalizations of existing MAB algorithms do not perform well for this problem.
We also present a simple variant of explore-then-commit and establish near-optimal regret bounds for this algorithm.
- Score: 1.6631602844999724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we consider a novel variant of the multi-armed bandit (MAB)
problem, MAB with cost subsidy, which models many real-life applications where
the learning agent has to pay to select an arm and is concerned about
optimizing cumulative costs and rewards. We present two applications,
intelligent SMS routing problem and ad audience optimization problem faced by
several businesses (especially online platforms), and show how our problem
uniquely captures key features of these applications. We show that naive
generalizations of existing MAB algorithms like Upper Confidence Bound and
Thompson Sampling do not perform well for this problem. We then establish a
fundamental lower bound on the performance of any online learning algorithm for
this problem, highlighting the hardness of our problem in comparison to the
classical MAB problem. We also present a simple variant of explore-then-commit
and establish near-optimal regret bounds for this algorithm. Lastly, we perform
extensive numerical simulations to understand the behavior of a suite of
algorithms for various instances and recommend a practical guide to employ
different algorithms.
Related papers
- Improving Thompson Sampling via Information Relaxation for Budgeted Multi-armed Bandits [1.4732811715354452]
We consider a $K$-armed bandit problem in which each arm consumes a different amount of resources when selected.
We propose a series of algorithms that are randomized like Thompson Sampling but more carefully optimize their decisions with respect to the budget constraint.
arXiv Detail & Related papers (2024-08-28T04:56:06Z) - A Federated Online Restless Bandit Framework for Cooperative Resource Allocation [23.698976872351576]
We study the cooperative resource allocation problem with unknown system dynamics of MRPs.
We put forth a Federated Thompson-enabled Whittle Index (FedTSWI) algorithm to solve this multi-agent online RMAB problem.
Numerical results show that the proposed algorithm achieves a fast convergence rate of $mathcalO(sqrtTlog(T))$ and better performance compared with baselines.
arXiv Detail & Related papers (2024-06-12T08:34:53Z) - Best Arm Identification with Fixed Budget: A Large Deviation Perspective [54.305323903582845]
We present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
In particular, we present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
arXiv Detail & Related papers (2023-12-19T13:17:43Z) - Optimizing Solution-Samplers for Combinatorial Problems: The Landscape
of Policy-Gradient Methods [52.0617030129699]
We introduce a novel theoretical framework for analyzing the effectiveness of DeepMatching Networks and Reinforcement Learning methods.
Our main contribution holds for a broad class of problems including Max-and Min-Cut, Max-$k$-Bipartite-Bi, Maximum-Weight-Bipartite-Bi, and Traveling Salesman Problem.
As a byproduct of our analysis we introduce a novel regularization process over vanilla descent and provide theoretical and experimental evidence that it helps address vanishing-gradient issues and escape bad stationary points.
arXiv Detail & Related papers (2023-10-08T23:39:38Z) - Stochastic Rising Bandits [40.32303434592863]
We study a particular case of the rested and restless bandits in which the arms' expected payoff is monotonically non-decreasing.
This characteristic allows designing specifically crafted algorithms that exploit the regularity of the payoffs to provide tight regret bounds.
We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset.
arXiv Detail & Related papers (2022-12-07T17:30:45Z) - Softmax with Regularization: Better Value Estimation in Multi-Agent
Reinforcement Learning [72.28520951105207]
Overestimation in $Q$-learning is an important problem that has been extensively studied in single-agent reinforcement learning.
We propose a novel regularization-based update scheme that penalizes large joint action-values deviating from a baseline.
We show that our method provides a consistent performance improvement on a set of challenging StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2021-03-22T14:18:39Z) - Upper Confidence Bounds for Combining Stochastic Bandits [52.10197476419621]
We provide a simple method to combine bandit algorithms.
Our approach is based on a "meta-UCB" procedure that treats each of $N$ individual bandit algorithms as arms in a higher-level $N$-armed bandit problem.
arXiv Detail & Related papers (2020-12-24T05:36:29Z) - Constrained Multiagent Rollout and Multidimensional Assignment with the
Auction Algorithm [0.0]
We consider an extension of the rollout algorithm that applies to constrained deterministic dynamic programming.
We show that if the base produces a feasible solution, the rollout algorithm has a cost improvement property.
We show that the cost improvement property is maintained with an alternative implementation that has greatly reduced computational requirements.
arXiv Detail & Related papers (2020-02-18T07:09:06Z) - GACEM: Generalized Autoregressive Cross Entropy Method for Multi-Modal
Black Box Constraint Satisfaction [69.94831587339539]
We present a modified Cross-Entropy Method (CEM) that uses a masked auto-regressive neural network for modeling uniform distributions over the solution space.
Our algorithm is able to express complicated solution spaces, thus allowing it to track a variety of different solution regions.
arXiv Detail & Related papers (2020-02-17T20:21:20Z) - Thompson Sampling Algorithms for Mean-Variance Bandits [97.43678751629189]
We develop Thompson Sampling-style algorithms for mean-variance MAB.
We also provide comprehensive regret analyses for Gaussian and Bernoulli bandits.
Our algorithms significantly outperform existing LCB-based algorithms for all risk tolerances.
arXiv Detail & Related papers (2020-02-01T15:33:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.