Related papers: Master-slave Deep Architecture for Top-K Multi-armed Bandits with Non-linear Bandit Feedback and Diversity Constraints

Master-slave Deep Architecture for Top-K Multi-armed Bandits with Non-linear Bandit Feedback and Diversity Constraints

URL: http://arxiv.org/abs/2308.12680v1
Date: Thu, 24 Aug 2023 09:39:04 GMT
Title: Master-slave Deep Architecture for Top-K Multi-armed Bandits with Non-linear Bandit Feedback and Diversity Constraints
Authors: Hanchi Huang, Li Shen, Deheng Ye, Wei Liu
Abstract summary: We propose a novel master-slave architecture to solve the top-$K$ multi-armed bandits problem. To the best of our knowledge, it is the first bandits setting considering diversity constraints under bandit feedback.
Score: 21.109631268204215
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose a novel master-slave architecture to solve the top-$K$ combinatorial multi-armed bandits problem with non-linear bandit feedback and diversity constraints, which, to the best of our knowledge, is the first combinatorial bandits setting considering diversity constraints under bandit feedback. Specifically, to efficiently explore the combinatorial and constrained action space, we introduce six slave models with distinguished merits to generate diversified samples well balancing rewards and constraints as well as efficiency. Moreover, we propose teacher learning based optimization and the policy co-training technique to boost the performance of the multiple slave models. The master model then collects the elite samples provided by the slave models and selects the best sample estimated by a neural contextual UCB-based network to make a decision with a trade-off between exploration and exploitation. Thanks to the elaborate design of slave models, the co-training mechanism among slave models, and the novel interactions between the master and slave models, our approach significantly surpasses existing state-of-the-art algorithms in both synthetic and real datasets for recommendation tasks. The code is available at: \url{https://github.com/huanghanchi/Master-slave-Algorithm-for-Top-K-Bandits}.

Related papers

Semi-Parametric Batched Global Multi-Armed Bandits with Covariates [0.48342038441006807]
The multi-armed bandits (MAB) framework is a widely used approach for sequential decision-making. We propose a novel semi-parametric framework for batched bandits with coparametrics and a shared parameter across arms. Our algorithm, Batched single-Index Dynamic binning and Successive arm elimination (BIDS), employs a batched successive arm elimination strategy.
arXiv Detail & Related papers (2025-03-01T17:23:55Z)
Online Clustering of Dueling Bandits [59.09590979404303]
We introduce the first "clustering of dueling bandit algorithms" to enable collaborative decision-making based on preference feedback. We propose two novel algorithms: (1) Clustering of Linear Dueling Bandits (COLDB) which models the user reward functions as linear functions of the context vectors, and (2) Clustering of Neural Dueling Bandits (CONDB) which uses a neural network to model complex, non-linear user reward functions.
arXiv Detail & Related papers (2025-02-04T07:55:41Z)
Offline Learning for Combinatorial Multi-armed Bandits [56.96242764723241]
Off-CMAB is the first offline learning framework for CMAB. Off-CMAB combines pessimistic reward estimations with solvers. Experiments on synthetic and real-world datasets highlight the superior performance of CLCB.
arXiv Detail & Related papers (2025-01-31T16:56:18Z)
Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent [74.02034188307857]
Merging multiple expert models offers a promising approach for performing multi-task learning without accessing their original data. We find existing methods inevitably discard task-specific information that, while causing conflicts, is crucial for performance. Our approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains.
arXiv Detail & Related papers (2025-01-02T12:45:21Z)
Combinatorial Rising Bandit [29.357803270943254]
We introduce the problem of rising bandit to minimize policy regret and propose a provably efficient algorithm called Combinatorial Rising Upper Confidence Bound (CRUCB) CRUCB is provably efficient by showing that the regret upper bound is close to the regret lower bound. In addition, we empirically demonstrate the effectiveness and superiority of CRUCB not only in synthetic environments but also in realistic applications of deep reinforcement learning.
arXiv Detail & Related papers (2024-12-01T12:52:18Z)
Optimal Design for Reward Modeling in RLHF [83.3614658277817]
We formalize the reward training model in Reinforcement Learning from Human Feedback. We frame the selection of an effective dataset as a simple regret minimization task. We derive bounds on the simple regret under appropriate assumptions.
arXiv Detail & Related papers (2024-10-22T14:36:44Z)
Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms. We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z)
Jump Starting Bandits with LLM-Generated Prior Knowledge [5.344012058238259]
We show that Large Language Models can jump-start contextual multi-armed bandits to reduce online learning regret. We propose an algorithm for contextual bandits by prompting LLMs to produce a pre-training dataset of approximate human preferences for the bandit.
arXiv Detail & Related papers (2024-06-27T16:52:19Z)
ReconBoost: Boosting Can Achieve Modality Reconcilement [89.4377895465204]
We study the modality-alternating learning paradigm to achieve reconcilement. We propose a new method called ReconBoost to update a fixed modality each time. We show that the proposed method resembles Friedman's Gradient-Boosting (GB) algorithm, where the updated learner can correct errors made by others.
arXiv Detail & Related papers (2024-05-15T13:22:39Z)
RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models. The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z)
A Minimaximalist Approach to Reinforcement Learning from Human Feedback [49.45285664482369]
We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning from human feedback. Our approach is minimalist in that it does not require training a reward model nor unstable adversarial training. We demonstrate that on a suite of continuous control tasks, we are able to learn significantly more efficiently than reward-model based approaches.
arXiv Detail & Related papers (2024-01-08T17:55:02Z)
Multi-Source Test-Time Adaptation as Dueling Bandits for Extractive Question Answering [25.44581667865143]
We study multi-source test-time model adaptation from user feedback, where K distinct models are established for adaptation. We discuss two frameworks: multi-armed bandit learning and multi-armed dueling bandits. Compared to multi-armed bandit learning, the dueling framework allows pairwise collaboration among K models, which is solved by a novel method named Co-UCB proposed in this work.
arXiv Detail & Related papers (2023-06-11T21:18:50Z)
Stochastic Rising Bandits [40.32303434592863]
We study a particular case of the rested and restless bandits in which the arms' expected payoff is monotonically non-decreasing. This characteristic allows designing specifically crafted algorithms that exploit the regularity of the payoffs to provide tight regret bounds. We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset.
arXiv Detail & Related papers (2022-12-07T17:30:45Z)
Bias-Robust Bayesian Optimization via Dueling Bandit [57.82422045437126]
We consider Bayesian optimization in settings where observations can be adversarially biased. We propose a novel approach for dueling bandits based on information-directed sampling (IDS) Thereby, we obtain the first efficient kernelized algorithm for dueling bandits that comes with cumulative regret guarantees.
arXiv Detail & Related papers (2021-05-25T10:08:41Z)
Effects of Model Misspecification on Bayesian Bandits: Case Studies in UX Optimization [8.704145252476705]
We present a novel formulation as a restless, sleeping bandit with unobserved confounders plus optional stopping. Case studies show how common misspecifications can lead to sub-optimal rewards. We also present the first model to exploit cointegration in a restless bandit, demonstrating that finite regret and fast and consistent optional stopping are possible.
arXiv Detail & Related papers (2020-10-07T14:34:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.