Master-slave Deep Architecture for Top-K Multi-armed Bandits with
Non-linear Bandit Feedback and Diversity Constraints
- URL: http://arxiv.org/abs/2308.12680v1
- Date: Thu, 24 Aug 2023 09:39:04 GMT
- Title: Master-slave Deep Architecture for Top-K Multi-armed Bandits with
Non-linear Bandit Feedback and Diversity Constraints
- Authors: Hanchi Huang, Li Shen, Deheng Ye, Wei Liu
- Abstract summary: We propose a novel master-slave architecture to solve the top-$K$ multi-armed bandits problem.
To the best of our knowledge, it is the first bandits setting considering diversity constraints under bandit feedback.
- Score: 21.109631268204215
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a novel master-slave architecture to solve the top-$K$
combinatorial multi-armed bandits problem with non-linear bandit feedback and
diversity constraints, which, to the best of our knowledge, is the first
combinatorial bandits setting considering diversity constraints under bandit
feedback. Specifically, to efficiently explore the combinatorial and
constrained action space, we introduce six slave models with distinguished
merits to generate diversified samples well balancing rewards and constraints
as well as efficiency. Moreover, we propose teacher learning based optimization
and the policy co-training technique to boost the performance of the multiple
slave models. The master model then collects the elite samples provided by the
slave models and selects the best sample estimated by a neural contextual
UCB-based network to make a decision with a trade-off between exploration and
exploitation. Thanks to the elaborate design of slave models, the co-training
mechanism among slave models, and the novel interactions between the master and
slave models, our approach significantly surpasses existing state-of-the-art
algorithms in both synthetic and real datasets for recommendation tasks. The
code is available at:
\url{https://github.com/huanghanchi/Master-slave-Algorithm-for-Top-K-Bandits}.
Related papers
- Online Clustering of Dueling Bandits [59.09590979404303]
We introduce the first "clustering of dueling bandit algorithms" to enable collaborative decision-making based on preference feedback.
We propose two novel algorithms: (1) Clustering of Linear Dueling Bandits (COLDB) which models the user reward functions as linear functions of the context vectors, and (2) Clustering of Neural Dueling Bandits (CONDB) which uses a neural network to model complex, non-linear user reward functions.
arXiv Detail & Related papers (2025-02-04T07:55:41Z) - Offline Learning for Combinatorial Multi-armed Bandits [56.96242764723241]
Off-CMAB is the first offline learning framework for CMAB.
Off-CMAB combines pessimistic reward estimations with solvers.
Experiments on synthetic and real-world datasets highlight the superior performance of CLCB.
arXiv Detail & Related papers (2025-01-31T16:56:18Z) - Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent [74.02034188307857]
Merging multiple expert models offers a promising approach for performing multi-task learning without accessing their original data.
We find existing methods inevitably discard task-specific information that, while causing conflicts, is crucial for performance.
Our approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains.
arXiv Detail & Related papers (2025-01-02T12:45:21Z) - Combinatorial Rising Bandit [29.357803270943254]
We introduce the problem of rising bandit to minimize policy regret and propose a provably efficient algorithm called Combinatorial Rising Upper Confidence Bound (CRUCB)
CRUCB is provably efficient by showing that the regret upper bound is close to the regret lower bound.
In addition, we empirically demonstrate the effectiveness and superiority of CRUCB not only in synthetic environments but also in realistic applications of deep reinforcement learning.
arXiv Detail & Related papers (2024-12-01T12:52:18Z) - Optimal Design for Reward Modeling in RLHF [83.3614658277817]
We formalize the reward training model in Reinforcement Learning from Human Feedback.
We frame the selection of an effective dataset as a simple regret minimization task.
We derive bounds on the simple regret under appropriate assumptions.
arXiv Detail & Related papers (2024-10-22T14:36:44Z) - Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms.
We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z) - Multi-Source Test-Time Adaptation as Dueling Bandits for Extractive
Question Answering [25.44581667865143]
We study multi-source test-time model adaptation from user feedback, where K distinct models are established for adaptation.
We discuss two frameworks: multi-armed bandit learning and multi-armed dueling bandits.
Compared to multi-armed bandit learning, the dueling framework allows pairwise collaboration among K models, which is solved by a novel method named Co-UCB proposed in this work.
arXiv Detail & Related papers (2023-06-11T21:18:50Z) - Bias-Robust Bayesian Optimization via Dueling Bandit [57.82422045437126]
We consider Bayesian optimization in settings where observations can be adversarially biased.
We propose a novel approach for dueling bandits based on information-directed sampling (IDS)
Thereby, we obtain the first efficient kernelized algorithm for dueling bandits that comes with cumulative regret guarantees.
arXiv Detail & Related papers (2021-05-25T10:08:41Z) - Effects of Model Misspecification on Bayesian Bandits: Case Studies in
UX Optimization [8.704145252476705]
We present a novel formulation as a restless, sleeping bandit with unobserved confounders plus optional stopping.
Case studies show how common misspecifications can lead to sub-optimal rewards.
We also present the first model to exploit cointegration in a restless bandit, demonstrating that finite regret and fast and consistent optional stopping are possible.
arXiv Detail & Related papers (2020-10-07T14:34:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.