Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems
- URL: http://arxiv.org/abs/2401.07298v1
- Date: Sun, 14 Jan 2024 14:14:19 GMT
- Title: Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems
- Authors: Yue Kang, Cho-Jui Hsieh, Thomas C. M. Lee
- Abstract summary: We study the generalized low-rank matrix bandit problem, proposed in citelu2021low under the Generalized Linear Model (GLM) framework.
To overcome the computational infeasibility and theoretical restrain of existing algorithms, we first propose the G-ESTT framework.
We show that G-ESTT can achieve the $tildeO(sqrt(d_1+d_2)3/2Mr3/2T)$ bound of regret while G-ESTS can achineve the $tildeO
- Score: 61.85150061213987
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the stochastic contextual low-rank matrix bandit problem, the expected
reward of an action is given by the inner product between the action's feature
matrix and some fixed, but initially unknown $d_1$ by $d_2$ matrix $\Theta^*$
with rank $r \ll \{d_1, d_2\}$, and an agent sequentially takes actions based
on past experience to maximize the cumulative reward. In this paper, we study
the generalized low-rank matrix bandit problem, which has been recently
proposed in \cite{lu2021low} under the Generalized Linear Model (GLM)
framework. To overcome the computational infeasibility and theoretical restrain
of existing algorithms on this problem, we first propose the G-ESTT framework
that modifies the idea from \cite{jun2019bilinear} by using Stein's method on
the subspace estimation and then leverage the estimated subspaces via a
regularization idea. Furthermore, we remarkably improve the efficiency of
G-ESTT by using a novel exclusion idea on the estimated subspace instead, and
propose the G-ESTS framework. We also show that G-ESTT can achieve the
$\tilde{O}(\sqrt{(d_1+d_2)MrT})$ bound of regret while G-ESTS can achineve the
$\tilde{O}(\sqrt{(d_1+d_2)^{3/2}Mr^{3/2}T})$ bound of regret under mild
assumption up to logarithm terms, where $M$ is some problem dependent value.
Under a reasonable assumption that $M = O((d_1+d_2)^2)$ in our problem setting,
the regret of G-ESTT is consistent with the current best regret of
$\tilde{O}((d_1+d_2)^{3/2} \sqrt{rT}/D_{rr})$~\citep{lu2021low} ($D_{rr}$ will
be defined later). For completeness, we conduct experiments to illustrate that
our proposed algorithms, especially G-ESTS, are also computationally tractable
and consistently outperform other state-of-the-art (generalized) linear matrix
bandit methods based on a suite of simulations.
Related papers
- LC-Tsallis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits [38.41164102066483]
This study considers the linear contextual bandit problem with independent and identically distributed contexts.
Our proposed algorithm is based on the Follow-The-Regularized-Leader with the Tsallis entropy and referred to as the $alpha$-textual-Con (LC)-Tsallis-INF.
arXiv Detail & Related papers (2024-03-05T18:59:47Z) - Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery [45.601316850669406]
We present efficient algorithms for policy evaluation, best policy identification and regret minimization.
For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal.
All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix.
arXiv Detail & Related papers (2024-02-24T06:36:08Z) - Best-of-Both-Worlds Algorithms for Linear Contextual Bandits [11.94312915280916]
We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits.
Our algorithms deliver near-optimal regret bounds in both the adversarial and adversarial regimes.
arXiv Detail & Related papers (2023-12-24T08:27:30Z) - Revisiting Weighted Strategy for Non-stationary Parametric Bandits [82.1942459195896]
This paper revisits the weighted strategy for non-stationary parametric bandits.
We propose a refined analysis framework, which produces a simpler weight-based algorithm.
Our new framework can be used to improve regret bounds of other parametric bandits.
arXiv Detail & Related papers (2023-03-05T15:11:14Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Thresholded Lasso Bandit [70.17389393497125]
Thresholded Lasso bandit is an algorithm that estimates the vector defining the reward function as well as its sparse support.
We establish non-asymptotic regret upper bounds scaling as $mathcalO( log d + sqrtT )$ in general, and as $mathcalO( log d + sqrtT )$ under the so-called margin condition.
arXiv Detail & Related papers (2020-10-22T19:14:37Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Low-Rank Generalized Linear Bandit Problems [19.052901817304743]
In a low-rank linear bandit problem, the reward of an action is the inner product between the action and an unknown low-rank matrix $Theta*$.
We propose an algorithm based on a novel combination of online-to-confidence-set conversioncitepabbasi2012online and the exponentially weighted average forecaster constructed by a covering of low-rank matrices.
arXiv Detail & Related papers (2020-06-04T15:30:14Z) - Naive Exploration is Optimal for Online LQR [49.681825576239355]
We show that the optimal regret scales as $widetildeTheta(sqrtd_mathbfu2 d_mathbfx T)$, where $T$ is the number of time steps, $d_mathbfu$ is the dimension of the input space, and $d_mathbfx$ is the dimension of the system state.
Our lower bounds rule out the possibility of a $mathrmpoly(logT)$-regret algorithm, which had been
arXiv Detail & Related papers (2020-01-27T03:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.