Generalizing Hierarchical Bayesian Bandits
- URL: http://arxiv.org/abs/2205.15124v1
- Date: Mon, 30 May 2022 14:17:56 GMT
- Title: Generalizing Hierarchical Bayesian Bandits
- Authors: Imad Aouali, Branislav Kveton, Sumeet Katariya
- Abstract summary: A contextual bandit is a popular and practical framework for online learning to act under uncertainty.
In this work, we introduce a general framework for capturing such correlations through a two-level graphical model.
We propose a Thompson sampling algorithm G-HierTS that uses this structure to explore efficiently and bound its Bayes regret.
- Score: 14.986031916712108
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A contextual bandit is a popular and practical framework for online learning
to act under uncertainty. In many problems, the number of actions is huge and
their mean rewards are correlated. In this work, we introduce a general
framework for capturing such correlations through a two-level graphical model
where actions are related through multiple shared latent parameters. We propose
a Thompson sampling algorithm G-HierTS that uses this structure to explore
efficiently and bound its Bayes regret. The regret has two terms, one for
learning action parameters and the other for learning the shared latent
parameters. The terms reflect the structure of our model as well as the quality
of priors. Our theoretical findings are validated empirically using both
synthetic and real-world problems. We also experiment with G-HierTS that
maintains a factored posterior over latent parameters. While this approximation
does not come with guarantees, it improves computational efficiency with a
minimal impact on empirical regret.
Related papers
- Batch Ensemble for Variance Dependent Regret in Stochastic Bandits [41.95653110232677]
Efficiently trading off exploration and exploitation is one of the key challenges in online Reinforcement Learning (RL)
Inspired by practical ensemble methods, in this work we propose a simple and novel batch ensemble scheme that achieves near-optimal regret for Multi-Armed Bandits (MAB)
Our algorithm has just a single parameter namely the number of batches, and its value does not depend on distributional properties such as the scale and variance of the losses.
arXiv Detail & Related papers (2024-09-13T06:40:56Z) - Beyond Two-Tower Matching: Learning Sparse Retrievable
Cross-Interactions for Recommendation [80.19762472699814]
Two-tower models are a prevalent matching framework for recommendation, which have been widely deployed in industrial applications.
It suffers two main challenges, including limited feature interaction capability and reduced accuracy in online serving.
We propose a new matching paradigm named SparCode, which supports not only sophisticated feature interactions but also efficient retrieval.
arXiv Detail & Related papers (2023-11-30T03:13:36Z) - Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM)
We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$.
Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z) - Langevin Thompson Sampling with Logarithmic Communication: Bandits and
Reinforcement Learning [34.4255062106615]
Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance.
We propose batched $textitLangevin Thompson Sampling$ algorithms that leverage MCMC methods to sample from approximate posteriors with only logarithmic communication costs in terms of batches.
Our algorithms are computationally efficient and maintain the same order-optimal regret guarantees of $mathcalO(log T)$ for MABs, and $mathcalO(sqrtT)$ for RL.
arXiv Detail & Related papers (2023-06-15T01:16:29Z) - Latent Feature Relation Consistency for Adversarial Robustness [80.24334635105829]
misclassification will occur when deep neural networks predict adversarial examples which add human-imperceptible adversarial noise to natural examples.
We propose textbfLatent textbfFeature textbfRelation textbfConsistency (textbfLFRC)
LFRC constrains the relation of adversarial examples in latent space to be consistent with the natural examples.
arXiv Detail & Related papers (2023-03-29T13:50:01Z) - Online Continuous Hyperparameter Optimization for Generalized Linear Contextual Bandits [55.03293214439741]
In contextual bandits, an agent sequentially makes actions from a time-dependent action set based on past experience.
We propose the first online continuous hyperparameter tuning framework for contextual bandits.
We show that it could achieve a sublinear regret in theory and performs consistently better than all existing methods on both synthetic and real datasets.
arXiv Detail & Related papers (2023-02-18T23:31:20Z) - Dual Instrumental Method for Confounded Kernelized Bandits [0.0]
The contextual bandit problem is a framework with wide applications in various fields.
We propose a confounded bandit problem where the noise becomes a latent confounder that affects both contexts and rewards.
We show that a dual instrumental variable regression can correctly identify the true reward function.
arXiv Detail & Related papers (2022-09-07T15:25:57Z) - Deep Hierarchy in Bandits [51.22833900944146]
Mean rewards of actions are often correlated.
To maximize statistical efficiency, it is important to leverage these correlations when learning.
We formulate a bandit variant of this problem where the correlations of mean action rewards are represented by a hierarchical Bayesian model.
arXiv Detail & Related papers (2022-02-03T08:15:53Z) - Scalable Personalised Item Ranking through Parametric Density Estimation [53.44830012414444]
Learning from implicit feedback is challenging because of the difficult nature of the one-class problem.
Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem.
We propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart.
arXiv Detail & Related papers (2021-05-11T03:38:16Z) - Influence Diagram Bandits: Variational Thompson Sampling for Structured
Bandit Problems [40.957688390621385]
Our framework captures complex statistical dependencies between actions, latent variables, and observations.
We develop novel online learning algorithms that learn to act efficiently in our models.
arXiv Detail & Related papers (2020-07-09T16:25:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.