Combinatorial Semi-Bandit in the Non-Stationary Environment
- URL: http://arxiv.org/abs/2002.03580v2
- Date: Sat, 19 Jun 2021 06:27:35 GMT
- Title: Combinatorial Semi-Bandit in the Non-Stationary Environment
- Authors: Wei Chen, Liwei Wang, Haoyu Zhao, Kai Zheng
- Abstract summary: We investigate the non-stationary semi-bandit problem, both in the switching case and in the dynamic case.
By employing another technique, our algorithm no longer needs to know the parameters $mathcal S$ or $mathcal V$ but the regret bounds could become suboptimal.
- Score: 27.394284055156795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we investigate the non-stationary combinatorial semi-bandit
problem, both in the switching case and in the dynamic case. In the general
case where (a) the reward function is non-linear, (b) arms may be
probabilistically triggered, and (c) only approximate offline oracle exists
\cite{wang2017improving}, our algorithm achieves
$\tilde{\mathcal{O}}(\sqrt{\mathcal{S} T})$ distribution-dependent regret in
the switching case, and $\tilde{\mathcal{O}}(\mathcal{V}^{1/3}T^{2/3})$ in the
dynamic case, where $\mathcal S$ is the number of switchings and $\mathcal V$
is the sum of the total ``distribution changes''. The regret bounds in both
scenarios are nearly optimal, but our algorithm needs to know the parameter
$\mathcal S$ or $\mathcal V$ in advance.
We further show that by employing another technique, our algorithm no longer
needs to know the parameters $\mathcal S$ or $\mathcal V$ but the regret bounds
could become suboptimal.
In a special case where the reward function is linear and we have an exact
oracle, we design a parameter-free algorithm that achieves nearly optimal
regret both in the switching case and in the dynamic case without knowing the
parameters in advance.
Related papers
- Partially Unitary Learning [0.0]
An optimal mapping between Hilbert spaces $IN$ of $left|psirightrangle$ and $OUT$ of $left|phirightrangle$ is presented.
An iterative algorithm for finding the global maximum of this optimization problem is developed.
arXiv Detail & Related papers (2024-05-16T17:13:55Z) - Memory-Constrained Algorithms for Convex Optimization via Recursive
Cutting-Planes [23.94542304111204]
First class of algorithms that provides a positive trade-off between gradient descent and cutting-plane methods in any regime with $epsilonleq 1/sqrt d$.
In the regime $epsilon leq d-Omega(d)$, our algorithm with $p=d$ achieves the information-theoretic optimal memory usage and improves the oracle-complexity of gradient descent.
arXiv Detail & Related papers (2023-06-16T17:00:51Z) - Contextual Recommendations and Low-Regret Cutting-Plane Algorithms [49.91214213074933]
We consider the following variant of contextual linear bandits motivated by routing applications in navigational engines and recommendation systems.
We design novel cutting-plane algorithms with low "regret" -- the total distance between the true point $w*$ and the hyperplanes the separation oracle returns.
arXiv Detail & Related papers (2021-06-09T05:39:05Z) - Linear Bandits on Uniformly Convex Sets [88.3673525964507]
Linear bandit algorithms yield $tildemathcalO(nsqrtT)$ pseudo-regret bounds on compact convex action sets.
Two types of structural assumptions lead to better pseudo-regret bounds.
arXiv Detail & Related papers (2021-03-10T07:33:03Z) - Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure.
We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds.
We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
arXiv Detail & Related papers (2021-02-15T08:16:51Z) - Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z) - Thresholded Lasso Bandit [70.17389393497125]
Thresholded Lasso bandit is an algorithm that estimates the vector defining the reward function as well as its sparse support.
We establish non-asymptotic regret upper bounds scaling as $mathcalO( log d + sqrtT )$ in general, and as $mathcalO( log d + sqrtT )$ under the so-called margin condition.
arXiv Detail & Related papers (2020-10-22T19:14:37Z) - Dimension Reduction in Contextual Online Learning via Nonparametric
Variable Selection [8.250528027387196]
The literature has shown that the optimal regret is $tildeO(T(d_x*+d_y+1)/(d_x*+d_y+2))$.
We propose a variable selection algorithm called textitBV-LASSO, which incorporates novel ideas such as binning and voting to apply LASSO to nonparametric settings.
arXiv Detail & Related papers (2020-09-17T13:08:45Z) - Taking a hint: How to leverage loss predictors in contextual bandits? [63.546913998407405]
We study learning in contextual bandits with the help of loss predictors.
We show that the optimal regret is $mathcalO(minsqrtT, sqrtmathcalETfrac13)$ when $mathcalE$ is known.
arXiv Detail & Related papers (2020-03-04T07:36:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.