Optimal Best-Arm Identification in Bandits with Access to Offline Data
- URL: http://arxiv.org/abs/2306.09048v1
- Date: Thu, 15 Jun 2023 11:12:35 GMT
- Title: Optimal Best-Arm Identification in Bandits with Access to Offline Data
- Authors: Shubhada Agrawal, Sandeep Juneja, Karthikeyan Shanmugam, Arun Sai
Suggala
- Abstract summary: We consider combining offline data with online learning, an area less studied but obvious practical importance.
We develop algorithms that match the lower bound on sample complexity when $delta$ is small.
Our algorithms are computationally efficient with an average per-sample acquisition cost of $tildeO(K)$, and rely on a careful characterization of the optimality conditions of the lower bound problem.
- Score: 27.365122983434887
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning paradigms based purely on offline data as well as those based solely
on sequential online learning have been well-studied in the literature. In this
paper, we consider combining offline data with online learning, an area less
studied but of obvious practical importance. We consider the stochastic
$K$-armed bandit problem, where our goal is to identify the arm with the
highest mean in the presence of relevant offline data, with confidence
$1-\delta$. We conduct a lower bound analysis on policies that provide such
$1-\delta$ probabilistic correctness guarantees. We develop algorithms that
match the lower bound on sample complexity when $\delta$ is small. Our
algorithms are computationally efficient with an average per-sample acquisition
cost of $\tilde{O}(K)$, and rely on a careful characterization of the
optimality conditions of the lower bound problem.
Related papers
- Stabilizing Linear Passive-Aggressive Online Learning with Weighted Reservoir Sampling [46.01254613933967]
Online learning methods are still highly effective for high-dimensional streaming data, out-of-core processing, and other throughput-sensitive applications.
Many such algorithms rely on fast adaptation to individual errors as a key to their convergence.
While such algorithms enjoy low theoretical regret, in real-world deployment they can be sensitive to individual outliers that cause the algorithm to over-correct.
arXiv Detail & Related papers (2024-10-31T03:35:48Z) - Leveraging Offline Data in Linear Latent Bandits [16.006405951752903]
We show that $textitevery$ exchangeable and coherent stateless decision process is a latent bandit.
We present a principled method to learn this subspace from short offline trajectories with guarantees.
We provide two methods to leverage this subspace online: LOCAL-UCB and ProBALL-UCB.
arXiv Detail & Related papers (2024-05-27T16:23:34Z) - Online non-parametric likelihood-ratio estimation by Pearson-divergence
functional minimization [55.98760097296213]
We introduce a new framework for online non-parametric LRE (OLRE) for the setting where pairs of iid observations $(x_t sim p, x'_t sim q)$ are observed over time.
We provide theoretical guarantees for the performance of the OLRE method along with empirical validation in synthetic experiments.
arXiv Detail & Related papers (2023-11-03T13:20:11Z) - Efficient Online Learning with Offline Datasets for Infinite Horizon
MDPs: A Bayesian Approach [25.77911741149966]
We show that if the learning agent models the behavioral policy used by the expert, it can do substantially better in terms of minimizing cumulative regret.
We then propose the Informed RLSVI algorithm to efficiently approximate the iPSRL algorithm.
arXiv Detail & Related papers (2023-10-17T19:01:08Z) - Importance Weighted Actor-Critic for Optimal Conservative Offline
Reinforcement Learning [23.222448307481073]
We propose a new practical algorithm for offline reinforcement learning (RL) in complex environments with insufficient data coverage.
Our algorithm combines the marginalized importance sampling framework with the actor-critic paradigm.
We provide both theoretical analysis and experimental results to validate the effectiveness of our proposed algorithm.
arXiv Detail & Related papers (2023-01-30T07:53:53Z) - On Instance-Dependent Bounds for Offline Reinforcement Learning with
Linear Function Approximation [80.86358123230757]
We present an algorithm called Bootstrapped and Constrained Pessimistic Value Iteration (BCP-VI)
Under a partial data coverage assumption, BCP-VI yields a fast rate of $tildemathcalO(frac1K)$ for offline RL when there is a positive gap in the optimal Q-value functions.
These are the first $tildemathcalO(frac1K)$ bound and absolute zero sub-optimality bound respectively for offline RL with linear function approximation from adaptive data.
arXiv Detail & Related papers (2022-11-23T18:50:44Z) - Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes.
We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm.
textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z) - Smoothed Online Learning is as Easy as Statistical Learning [77.00766067963195]
We provide the first oracle-efficient, no-regret algorithms in this setting.
We show that if a function class is learnable in the classical setting, then there is an oracle-efficient, no-regret algorithm for contextual bandits.
arXiv Detail & Related papers (2022-02-09T19:22:34Z) - Online Adversarial Attacks [57.448101834579624]
We formalize the online adversarial attack problem, emphasizing two key elements found in real-world use-cases.
We first rigorously analyze a deterministic variant of the online threat model.
We then propose algoname, a simple yet practical algorithm yielding a provably better competitive ratio for $k=2$ over the current best single threshold algorithm.
arXiv Detail & Related papers (2021-03-02T20:36:04Z) - Online Apprenticeship Learning [58.45089581278177]
In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function.
The goal is to find a policy that matches the expert's performance on some predefined set of cost functions.
We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms.
arXiv Detail & Related papers (2021-02-13T12:57:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.