Related papers: Lifelong Learning in Multi-Armed Bandits

Lifelong Learning in Multi-Armed Bandits

URL: http://arxiv.org/abs/2012.14264v1
Date: Mon, 28 Dec 2020 15:13:31 GMT
Title: Lifelong Learning in Multi-Armed Bandits
Authors: Matthieu Jedor, Jonathan Lou\"edec, Vianney Perchet
Abstract summary: We study the problem in the multi-armed bandit framework with the objective to minimize the total regret incurred over a series of tasks. While most bandit algorithms are designed to have a low worst-case regret, we examine here the average regret over bandit instances drawn from some prior distribution which may change over time.
Score: 22.301793734117805
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Continuously learning and leveraging the knowledge accumulated from prior tasks in order to improve future performance is a long standing machine learning problem. In this paper, we study the problem in the multi-armed bandit framework with the objective to minimize the total regret incurred over a series of tasks. While most bandit algorithms are designed to have a low worst-case regret, we examine here the average regret over bandit instances drawn from some prior distribution which may change over time. We specifically focus on confidence interval tuning of UCB algorithms. We propose a bandit over bandit approach with greedy algorithms and we perform extensive experimental evaluations in both stationary and non-stationary environments. We further apply our solution to the mortal bandit problem, showing empirical improvement over previous work.

Related papers

Contextual Bandits with Smooth Regret: Efficient Learning in Continuous Action Spaces [14.366265951396587]
We design efficient general-purpose contextual bandit algorithms for large -- or even continuous -- action spaces. We propose a smooth regret notion for contextual bandits, which dominates previously proposed alternatives. Our algorithms can be used to recover the previous minimax/Pareto optimal guarantees under the standard regret.
arXiv Detail & Related papers (2022-07-12T21:27:09Z)
Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When Partial Feedback Counts [53.579515853222986]
We study a novel bandit setting, namely Multi-Armed Bandit with Temporally-Partitioned Rewards (TP-MAB) This setting is a natural extension of delayed-feedback bandits to the case in which rewards may be dilated over a finite-time span after the pull. We provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and TP-UCB-EW.
arXiv Detail & Related papers (2022-06-01T15:56:59Z)
Slowly Changing Adversarial Bandit Algorithms are Efficient for Discounted MDPs [10.68896183306706]
Reinforcement learning generalizes multi-armed bandit problems with additional difficulties of a longer planning horizon and unknown transition kernel. We show that any slowly changing adversarial bandit algorithm achieving optimal regret in the adversarial bandit setting can also attain optimal expected regret in infinite-horizon discounted Markov decision processes.
arXiv Detail & Related papers (2022-05-18T16:40:30Z)
PAC-Bayesian Lifelong Learning For Multi-Armed Bandits [38.76324445090305]
We present a PAC-Bayesian analysis of lifelong learning. We consider the case when each learning task is a multi-armed bandit problem. We propose lifelong learning algorithms that use our new bounds as learning objectives.
arXiv Detail & Related papers (2022-03-07T11:23:12Z)
On Slowly-varying Non-stationary Bandits [25.305949034527202]
We consider dynamic regret in non-stationary bandits with a slowly varying property. We establish the first instance-dependent regret upper bound for slowly varying non-stationary bandits. We show that our algorithm is essentially minimax optimal.
arXiv Detail & Related papers (2021-10-25T12:56:19Z)
Confidence-Budget Matching for Sequential Budgeted Learning [69.77435313099366]
We formalize decision-making problems with querying budget. We consider multi-armed bandits, linear bandits, and reinforcement learning problems. We show that CBM based algorithms perform well in the presence of adversity.
arXiv Detail & Related papers (2021-02-05T19:56:31Z)
A Regret bound for Non-stationary Multi-Armed Bandits with Fairness Constraints [7.716156977428555]
We present a new algorithm called Fair Upper Confidence Bound with Exploration Fair-UCBe for solving a slowly varying $k$-armed bandit problem. We show that the performance of our algorithm in the non-stationary case approaches that of its stationary counterpart tends to zero.
arXiv Detail & Related papers (2020-12-24T18:12:01Z)
Upper Confidence Bounds for Combining Stochastic Bandits [52.10197476419621]
We provide a simple method to combine bandit algorithms. Our approach is based on a "meta-UCB" procedure that treats each of $N$ individual bandit algorithms as arms in a higher-level $N$-armed bandit problem.
arXiv Detail & Related papers (2020-12-24T05:36:29Z)
Instance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based Perspective [104.67295710363679]
In the classical multi-armed bandit problem, instance-dependent algorithms attain improved performance on "easy" problems with a gap between the best and second-best arm. We introduce a family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds. We then introduce new oracle-efficient algorithms which adapt to the gap whenever possible, while also attaining the minimax rate in the worst case.
arXiv Detail & Related papers (2020-10-07T01:33:06Z)
Corralling Stochastic Bandit Algorithms [54.10645564702416]
We show that the regret of the corralling algorithms is no worse than that of the best algorithm containing the arm with the highest reward. We show that the gap between the highest reward and other rewards depends on the gap between the highest reward and other rewards.
arXiv Detail & Related papers (2020-06-16T15:33:12Z)
Meta-learning with Stochastic Linear Bandits [120.43000970418939]
We consider a class of bandit algorithms that implement a regularized version of the well-known OFUL algorithm, where the regularization is a square euclidean distance to a bias vector. We show both theoretically and experimentally, that when the number of tasks grows and the variance of the task-distribution is small, our strategies have a significant advantage over learning the tasks in isolation.
arXiv Detail & Related papers (2020-05-18T08:41:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.