Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions
- URL: http://arxiv.org/abs/2101.07957v2
- Date: Sat, 27 Feb 2021 11:22:30 GMT
- Title: Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions
- Authors: Kei Takemura, Shinji Ito, Daisuke Hatano, Hanna Sumita, Takuro
Fukunaga, Naonori Kakimura, Ken-ichi Kawarabayashi
- Abstract summary: We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
- Score: 53.77572276969548
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The contextual combinatorial semi-bandit problem with linear payoff functions
is a decision-making problem in which a learner chooses a set of arms with the
feature vectors in each round under given constraints so as to maximize the sum
of rewards of arms. Several existing algorithms have regret bounds that are
optimal with respect to the number of rounds $T$. However, there is a gap of
$\tilde{O}(\max(\sqrt{d}, \sqrt{k}))$ between the current best upper and lower
bounds, where $d$ is the dimension of the feature vectors, $k$ is the number of
the chosen arms in a round, and $\tilde{O}(\cdot)$ ignores the logarithmic
factors. The dependence of $k$ and $d$ is of practical importance because $k$
may be larger than $T$ in real-world applications such as recommender systems.
In this paper, we fill the gap by improving the upper and lower bounds. More
precisely, we show that the C${}^2$UCB algorithm proposed by Qin, Chen, and Zhu
(2014) has the optimal regret bound $\tilde{O}(d\sqrt{kT} + dk)$ for the
partition matroid constraints. For general constraints, we propose an algorithm
that modifies the reward estimates of arms in the C${}^2$UCB algorithm and
demonstrate that it enjoys the optimal regret bound for a more general problem
that can take into account other objectives simultaneously. We also show that
our technique would be applicable to related problems. Numerical experiments
support our theoretical results and considerations.
Related papers
- LC-Tsallis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits [38.41164102066483]
This study considers the linear contextual bandit problem with independent and identically distributed contexts.
Our proposed algorithm is based on the Follow-The-Regularized-Leader with the Tsallis entropy and referred to as the $alpha$-textual-Con (LC)-Tsallis-INF.
arXiv Detail & Related papers (2024-03-05T18:59:47Z) - Combinatorial Bandits for Maximum Value Reward Function under Max
Value-Index Feedback [9.771002043127728]
We consider a multi-armed bandit problem for maximum value reward function under maximum value and index feedback.
We propose an algorithm and provide a regret bound for problem instances with arm outcomes according to arbitrary distributions with finite supports.
Our algorithm achieves a $O((k/Delta)log(T))$ distribution-dependent and a $tildeO(sqrtT)$ distribution-independent regret.
arXiv Detail & Related papers (2023-05-25T14:02:12Z) - Tight Bounds for $\gamma$-Regret via the Decision-Estimation Coefficient [88.86699022151598]
We give a statistical characterization of the $gamma$-regret for arbitrary structured bandit problems.
The $gamma$-regret emerges in structured bandit problems over a function class $mathcalF$ where finding an exact optimum of $f in mathcalF$ is intractable.
arXiv Detail & Related papers (2023-03-06T17:54:33Z) - Bandits with many optimal arms [68.17472536610859]
We write $p*$ for the proportion of optimal arms and $Delta$ for the minimal mean-gap between optimal and sub-optimal arms.
We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting.
arXiv Detail & Related papers (2021-03-23T11:02:31Z) - Combinatorial Bandits without Total Order for Arms [52.93972547896022]
We present a reward model that captures set-dependent reward distribution and assumes no total order for arms.
We develop a novel regret analysis and show an $Oleft(frack2 n log Tepsilonright)$ gap-dependent regret bound as well as an $Oleft(k2sqrtn T log Tright)$ gap-independent regret bound.
arXiv Detail & Related papers (2021-03-03T23:08:59Z) - An Efficient Pessimistic-Optimistic Algorithm for Constrained Linear
Bandits [16.103685478563072]
This paper considers linear bandits with general constraints.
The objective is to maximize the expected cumulative reward over horizon $T$ subject to a set of constraints in each round.
We propose a pessimistic-optimistic algorithm for this problem, which is efficient in two aspects.
arXiv Detail & Related papers (2021-02-10T07:30:37Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Maximizing Determinants under Matroid Constraints [69.25768526213689]
We study the problem of finding a basis $S$ of $M$ such that $det(sum_i in Sv_i v_i v_itop)$ is maximized.
This problem appears in a diverse set of areas such as experimental design, fair allocation of goods, network design, and machine learning.
arXiv Detail & Related papers (2020-04-16T19:16:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.