Hyper-parameter Tuning for the Contextual Bandit
- URL: http://arxiv.org/abs/2005.02209v1
- Date: Mon, 4 May 2020 17:20:19 GMT
- Title: Hyper-parameter Tuning for the Contextual Bandit
- Authors: Djallel Bouneffouf and Emmanuelle Claeys
- Abstract summary: We study the problem of learning the exploration exploitation trade-off in the contextual bandit problem with linear reward function setting.
Our proposed algorithm learn to choose the right exploration parameters in an online manner based on the observed context.
We have presented here two algorithms that uses a bandit to find the optimal exploration of the contextual bandit algorithm.
- Score: 22.721128745617076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study here the problem of learning the exploration exploitation trade-off
in the contextual bandit problem with linear reward function setting. In the
traditional algorithms that solve the contextual bandit problem, the
exploration is a parameter that is tuned by the user. However, our proposed
algorithm learn to choose the right exploration parameters in an online manner
based on the observed context, and the immediate reward received for the chosen
action. We have presented here two algorithms that uses a bandit to find the
optimal exploration of the contextual bandit algorithm, which we hope is the
first step toward the automation of the multi-armed bandit algorithm.
Related papers
- Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms.
We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z) - Multi-Player Approaches for Dueling Bandits [58.442742345319225]
We show that the direct use of a Follow Your Leader black-box approach matches the lower bound for this setting.
We also analyze a message-passing fully distributed approach with a novel Condorcet-winner recommendation protocol.
arXiv Detail & Related papers (2024-05-25T10:25:48Z) - Feel-Good Thompson Sampling for Contextual Dueling Bandits [49.450050682705026]
We propose a Thompson sampling algorithm, named FGTS.CDB, for linear contextual dueling bandits.
At the core of our algorithm is a new Feel-Good exploration term specifically tailored for dueling bandits.
Our algorithm achieves nearly minimax-optimal regret, i.e., $tildemathcalO(dsqrt T)$, where $d$ is the model dimension and $T$ is the time horizon.
arXiv Detail & Related papers (2024-04-09T04:45:18Z) - Thompson Sampling with Virtual Helping Agents [0.0]
We address the problem of online sequential decision making, i.e., balancing the trade-off between exploiting current knowledge to maximize immediate performance and exploring the new information to gain long-term benefits.
We propose two algorithms for the multi-armed bandit problem and provide theoretical bounds on the cumulative regret.
arXiv Detail & Related papers (2022-09-16T23:34:44Z) - Syndicated Bandits: A Framework for Auto Tuning Hyper-parameters in
Contextual Bandit Algorithms [74.55200180156906]
The contextual bandit problem models the trade-off between exploration and exploitation.
We show our Syndicated Bandits framework can achieve the optimal regret upper bounds.
arXiv Detail & Related papers (2021-06-05T22:30:21Z) - Upper Confidence Bounds for Combining Stochastic Bandits [52.10197476419621]
We provide a simple method to combine bandit algorithms.
Our approach is based on a "meta-UCB" procedure that treats each of $N$ individual bandit algorithms as arms in a higher-level $N$-armed bandit problem.
arXiv Detail & Related papers (2020-12-24T05:36:29Z) - Corralling Stochastic Bandit Algorithms [54.10645564702416]
We show that the regret of the corralling algorithms is no worse than that of the best algorithm containing the arm with the highest reward.
We show that the gap between the highest reward and other rewards depends on the gap between the highest reward and other rewards.
arXiv Detail & Related papers (2020-06-16T15:33:12Z) - Bandit algorithms to emulate human decision making using probabilistic
distortions [20.422725678982726]
We formulate two sample multi-armed bandit problems with distorted probabilities on the reward distributions.
We consider the aforementioned problems in the regret minimization as well as best arm identification framework for multi-armed bandits.
arXiv Detail & Related papers (2016-11-30T17:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.