Stochastic Linear Bandits with Protected Subspace
- URL: http://arxiv.org/abs/2011.01016v2
- Date: Mon, 1 Mar 2021 21:40:26 GMT
- Title: Stochastic Linear Bandits with Protected Subspace
- Authors: Advait Parulekar, Soumya Basu, Aditya Gopalan, Karthikeyan Shanmugam,
Sanjay Shakkottai
- Abstract summary: We study a variant of the linear bandit problem wherein we optimize a linear objective function but rewards are accrued only to an unknown subspace.
In particular, at each round, the learner must choose whether to query the objective or the protected subspace alongside choosing an action.
Our algorithm, derived from the OFUL principle, uses some of the queries to get an estimate of the protected space.
- Score: 51.43660657268171
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study a variant of the stochastic linear bandit problem wherein we
optimize a linear objective function but rewards are accrued only orthogonal to
an unknown subspace (which we interpret as a \textit{protected space}) given
only zero-order stochastic oracle access to both the objective itself and
protected subspace. In particular, at each round, the learner must choose
whether to query the objective or the protected subspace alongside choosing an
action. Our algorithm, derived from the OFUL principle, uses some of the
queries to get an estimate of the protected space, and (in almost all rounds)
plays optimistically with respect to a confidence set for this space. We
provide a $\tilde{O}(sd\sqrt{T})$ regret upper bound in the case where the
action space is the complete unit ball in $\mathbb{R}^d$, $s < d$ is the
dimension of the protected subspace, and $T$ is the time horizon. Moreover, we
demonstrate that a discrete action space can lead to linear regret with an
optimistic algorithm, reinforcing the sub-optimality of optimism in certain
settings. We also show that protection constraints imply that for certain
settings, no consistent algorithm can have a regret smaller than
$\Omega(T^{3/4}).$ We finally empirically validate our results with synthetic
and real datasets.
Related papers
- Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery [45.601316850669406]
We present efficient algorithms for policy evaluation, best policy identification and regret minimization.
For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal.
All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix.
arXiv Detail & Related papers (2024-02-24T06:36:08Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - On the Interplay Between Misspecification and Sub-optimality Gap in
Linear Contextual Bandits [76.2262680277608]
We study linear contextual bandits in the misspecified setting, where the expected reward function can be approximated by a linear function class.
We show that our algorithm enjoys the same gap-dependent regret bound $tilde O (d2/Delta)$ as in the well-specified setting up to logarithmic factors.
arXiv Detail & Related papers (2023-03-16T15:24:29Z) - Exploration in Linear Bandits with Rich Action Sets and its Implications
for Inference [23.364534479492715]
We show that the minimum eigenvalue of the expected matrix grows as $Omega(sqrtn) whenever the expected cumulative regret of the algorithm is $sqrtn)$.
We apply our result to two practical scenarios -- emphmodel selection and emphclustering in linear bandits.
arXiv Detail & Related papers (2022-07-23T20:25:07Z) - Towards Painless Policy Optimization for Constrained MDPs [46.12526917024248]
We study policy optimization in an infinite horizon, $gamma$-discounted constrained Markov decision process (CMDP)
Our objective is to return a policy that achieves large expected reward with a small constraint violation.
We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms.
arXiv Detail & Related papers (2022-04-11T15:08:09Z) - High-Dimensional Experimental Design and Kernel Bandits [9.401375475860561]
Methods from optimal linear experimental design have been leveraged to obtain state of the art results for linear bandits.
A design returned from an objective such as $G$-optimal design is actually a probability distribution over a pool of potential measurement vectors.
We propose a rounding procedure that frees $N$ of any dependence on the dimension $d$.
arXiv Detail & Related papers (2021-05-12T17:10:56Z) - Nearly Dimension-Independent Sparse Linear Bandit over Small Action
Spaces via Best Subset Selection [71.9765117768556]
We consider the contextual bandit problem under the high dimensional linear model.
This setting finds essential applications such as personalized recommendation, online advertisement, and personalized medicine.
We propose doubly growing epochs and estimating the parameter using the best subset selection method.
arXiv Detail & Related papers (2020-09-04T04:10:39Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.