Combinatorial Rising Bandit
- URL: http://arxiv.org/abs/2412.00798v2
- Date: Mon, 03 Feb 2025 09:25:56 GMT
- Title: Combinatorial Rising Bandit
- Authors: Seockbean Song, Youngsik Yoon, Siwei Wang, Wei Chen, Jungseul Ok,
- Abstract summary: We introduce the problem of rising bandit to minimize policy regret and propose a provably efficient algorithm called Combinatorial Rising Upper Confidence Bound (CRUCB)
CRUCB is provably efficient by showing that the regret upper bound is close to the regret lower bound.
In addition, we empirically demonstrate the effectiveness and superiority of CRUCB not only in synthetic environments but also in realistic applications of deep reinforcement learning.
- Score: 29.357803270943254
- License:
- Abstract: Combinatorial online learning is a fundamental task to decide the optimal combination of base arms in sequential interactions with systems providing uncertain rewards, which is applicable to diverse domains such as robotics, social advertising, network routing and recommendation systems. In real-world scenarios, we often observe rising rewards, where the selection of a base arm not only provides an instantaneous reward but also contributes to the enhancement of future rewards, {\it e.g.}, robots enhancing proficiency through practice and social influence strengthening in the history of successful recommendations. To address this, we introduce the problem of combinatorial rising bandit to minimize policy regret and propose a provably efficient algorithm, called Combinatorial Rising Upper Confidence Bound (CRUCB), of which regret upper bound is close to a regret lower bound. To the best of our knowledge, previous studies do not provide a sub-linear regret lower bound, making it impossible to assess the efficiency of their algorithms. However, we provide the sub-linear regret lower bound for combinatorial rising bandit and show that CRUCB is provably efficient by showing that the regret upper bound is close to the regret lower bound. In addition, we empirically demonstrate the effectiveness and superiority of CRUCB not only in synthetic environments but also in realistic applications of deep reinforcement learning.
Related papers
- Best Arm Identification with Minimal Regret [55.831935724659175]
Best arm identification problem elegantly amalgamates regret minimization and BAI.
Agent's goal is to identify the best arm with a prescribed confidence level.
Double KL-UCB algorithm achieves optimality as the confidence level tends to zero.
arXiv Detail & Related papers (2024-09-27T16:46:02Z) - Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms.
We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z) - Improving Reward-Conditioned Policies for Multi-Armed Bandits using Normalized Weight Functions [8.90692770076582]
Recently proposed reward-conditioned policies (RCPs) offer an appealing alternative in reinforcement learning.
We show that RCPs are slower to converge and have inferior expected rewards at convergence, compared with classic methods.
We refer to this technique as generalized marginalization, whose advantage is that negative weights for policies conditioned on low rewards can make the resulting policies more distinct from them.
arXiv Detail & Related papers (2024-06-16T03:43:55Z) - Anti-Exploration by Random Network Distillation [63.04360288089277]
We show that a naive choice of conditioning for the Random Network Distillation (RND) is not discriminative enough to be used as an uncertainty estimator.
We show that this limitation can be avoided with conditioning based on Feature-wise Linear Modulation (FiLM)
We evaluate it on the D4RL benchmark, showing that it is capable of achieving performance comparable to ensemble-based methods and outperforming ensemble-free approaches by a wide margin.
arXiv Detail & Related papers (2023-01-31T13:18:33Z) - Contextual bandits with concave rewards, and an application to fair
ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR)
We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem.
Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z) - Versatile Dueling Bandits: Best-of-both-World Analyses for Online
Learning from Preferences [28.79598714109439]
We study the problem of $K$-armed dueling bandit for both and adversarial environments.
We first propose a novel reduction from any (general) dueling bandits to multi-armed bandits.
Our algorithm is also the first to achieve an optimal $O(sum_i = 1K fraclog TDelta_i)$ regret bound against the Condorcet-winner benchmark.
arXiv Detail & Related papers (2022-02-14T13:37:23Z) - Robust Upper Bounds for Adversarial Training [4.971729553254843]
We introduce a new approach to adversarial training by minimizing an upper bound of the adversarial loss.
This bound is based on a holistic expansion of the network instead of separate bounds for each layer.
We derive two new methods with the proposed approach.
arXiv Detail & Related papers (2021-12-17T01:52:35Z) - Instance-Dependent Complexity of Contextual Bandits and Reinforcement
Learning: A Disagreement-Based Perspective [104.67295710363679]
In the classical multi-armed bandit problem, instance-dependent algorithms attain improved performance on "easy" problems with a gap between the best and second-best arm.
We introduce a family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds.
We then introduce new oracle-efficient algorithms which adapt to the gap whenever possible, while also attaining the minimax rate in the worst case.
arXiv Detail & Related papers (2020-10-07T01:33:06Z) - Corruption-robust exploration in episodic reinforcement learning [76.19192549843727]
We study multi-stage episodic reinforcement learning under adversarial corruptions in both the rewards and the transition probabilities of the underlying system.
Our framework yields efficient algorithms which attain near-optimal regret in the absence of corruptions.
Notably, our work provides the first sublinear regret guarantee which any deviation from purely i.i.d. transitions in the bandit-feedback model for episodic reinforcement learning.
arXiv Detail & Related papers (2019-11-20T03:49:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.