Lifelong Learning in Multi-Armed Bandits
- URL: http://arxiv.org/abs/2012.14264v1
- Date: Mon, 28 Dec 2020 15:13:31 GMT
- Title: Lifelong Learning in Multi-Armed Bandits
- Authors: Matthieu Jedor, Jonathan Lou\"edec, Vianney Perchet
- Abstract summary: We study the problem in the multi-armed bandit framework with the objective to minimize the total regret incurred over a series of tasks.
While most bandit algorithms are designed to have a low worst-case regret, we examine here the average regret over bandit instances drawn from some prior distribution which may change over time.
- Score: 22.301793734117805
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Continuously learning and leveraging the knowledge accumulated from prior
tasks in order to improve future performance is a long standing machine
learning problem. In this paper, we study the problem in the multi-armed bandit
framework with the objective to minimize the total regret incurred over a
series of tasks. While most bandit algorithms are designed to have a low
worst-case regret, we examine here the average regret over bandit instances
drawn from some prior distribution which may change over time. We specifically
focus on confidence interval tuning of UCB algorithms. We propose a bandit over
bandit approach with greedy algorithms and we perform extensive experimental
evaluations in both stationary and non-stationary environments. We further
apply our solution to the mortal bandit problem, showing empirical improvement
over previous work.
Related papers
- Contextual Bandits with Smooth Regret: Efficient Learning in Continuous
Action Spaces [14.366265951396587]
We design efficient general-purpose contextual bandit algorithms for large -- or even continuous -- action spaces.
We propose a smooth regret notion for contextual bandits, which dominates previously proposed alternatives.
Our algorithms can be used to recover the previous minimax/Pareto optimal guarantees under the standard regret.
arXiv Detail & Related papers (2022-07-12T21:27:09Z) - Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When
Partial Feedback Counts [53.579515853222986]
We study a novel bandit setting, namely Multi-Armed Bandit with Temporally-Partitioned Rewards (TP-MAB)
This setting is a natural extension of delayed-feedback bandits to the case in which rewards may be dilated over a finite-time span after the pull.
We provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and TP-UCB-EW.
arXiv Detail & Related papers (2022-06-01T15:56:59Z) - Slowly Changing Adversarial Bandit Algorithms are Efficient for
Discounted MDPs [10.68896183306706]
Reinforcement learning generalizes multi-armed bandit problems with additional difficulties of a longer planning horizon and unknown transition kernel.
We show that any slowly changing adversarial bandit algorithm achieving optimal regret in the adversarial bandit setting can also attain optimal expected regret in infinite-horizon discounted Markov decision processes.
arXiv Detail & Related papers (2022-05-18T16:40:30Z) - PAC-Bayesian Lifelong Learning For Multi-Armed Bandits [38.76324445090305]
We present a PAC-Bayesian analysis of lifelong learning.
We consider the case when each learning task is a multi-armed bandit problem.
We propose lifelong learning algorithms that use our new bounds as learning objectives.
arXiv Detail & Related papers (2022-03-07T11:23:12Z) - On Slowly-varying Non-stationary Bandits [25.305949034527202]
We consider dynamic regret in non-stationary bandits with a slowly varying property.
We establish the first instance-dependent regret upper bound for slowly varying non-stationary bandits.
We show that our algorithm is essentially minimax optimal.
arXiv Detail & Related papers (2021-10-25T12:56:19Z) - Confidence-Budget Matching for Sequential Budgeted Learning [69.77435313099366]
We formalize decision-making problems with querying budget.
We consider multi-armed bandits, linear bandits, and reinforcement learning problems.
We show that CBM based algorithms perform well in the presence of adversity.
arXiv Detail & Related papers (2021-02-05T19:56:31Z) - A Regret bound for Non-stationary Multi-Armed Bandits with Fairness
Constraints [7.716156977428555]
We present a new algorithm called Fair Upper Confidence Bound with Exploration Fair-UCBe for solving a slowly varying $k$-armed bandit problem.
We show that the performance of our algorithm in the non-stationary case approaches that of its stationary counterpart tends to zero.
arXiv Detail & Related papers (2020-12-24T18:12:01Z) - Upper Confidence Bounds for Combining Stochastic Bandits [52.10197476419621]
We provide a simple method to combine bandit algorithms.
Our approach is based on a "meta-UCB" procedure that treats each of $N$ individual bandit algorithms as arms in a higher-level $N$-armed bandit problem.
arXiv Detail & Related papers (2020-12-24T05:36:29Z) - Instance-Dependent Complexity of Contextual Bandits and Reinforcement
Learning: A Disagreement-Based Perspective [104.67295710363679]
In the classical multi-armed bandit problem, instance-dependent algorithms attain improved performance on "easy" problems with a gap between the best and second-best arm.
We introduce a family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds.
We then introduce new oracle-efficient algorithms which adapt to the gap whenever possible, while also attaining the minimax rate in the worst case.
arXiv Detail & Related papers (2020-10-07T01:33:06Z) - Corralling Stochastic Bandit Algorithms [54.10645564702416]
We show that the regret of the corralling algorithms is no worse than that of the best algorithm containing the arm with the highest reward.
We show that the gap between the highest reward and other rewards depends on the gap between the highest reward and other rewards.
arXiv Detail & Related papers (2020-06-16T15:33:12Z) - Meta-learning with Stochastic Linear Bandits [120.43000970418939]
We consider a class of bandit algorithms that implement a regularized version of the well-known OFUL algorithm, where the regularization is a square euclidean distance to a bias vector.
We show both theoretically and experimentally, that when the number of tasks grows and the variance of the task-distribution is small, our strategies have a significant advantage over learning the tasks in isolation.
arXiv Detail & Related papers (2020-05-18T08:41:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.