Bandits with Partially Observable Confounded Data
- URL: http://arxiv.org/abs/2006.06731v2
- Date: Tue, 10 Aug 2021 12:16:21 GMT
- Title: Bandits with Partially Observable Confounded Data
- Authors: Guy Tennenholtz, Uri Shalit, Shie Mannor, Yonathan Efroni
- Abstract summary: We show that this problem is closely related to a variant of the bandit problem with side information.
We construct a linear bandit algorithm that takes advantage of the projected information, and prove regret bounds.
Our results indicate that confounded offline data can significantly improve online learning algorithms.
- Score: 74.04376842070624
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study linear contextual bandits with access to a large, confounded,
offline dataset that was sampled from some fixed policy. We show that this
problem is closely related to a variant of the bandit problem with side
information. We construct a linear bandit algorithm that takes advantage of the
projected information, and prove regret bounds. Our results demonstrate the
ability to take advantage of confounded offline data. Particularly, we prove
regret bounds that improve current bounds by a factor related to the visible
dimensionality of the contexts in the data. Our results indicate that
confounded offline data can significantly improve online learning algorithms.
Finally, we demonstrate various characteristics of our approach through
synthetic simulations.
Related papers
- Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms.
We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z) - Understanding the performance gap between online and offline alignment algorithms [63.137832242488926]
We show that offline algorithms train policy to become good at pairwise classification, while online algorithms are good at generations.
This hints at a unique interplay between discriminative and generative capabilities, which is greatly impacted by the sampling process.
Our study sheds light on the pivotal role of on-policy sampling in AI alignment, and hints at certain fundamental challenges of offline alignment algorithms.
arXiv Detail & Related papers (2024-05-14T09:12:30Z) - Simple Ingredients for Offline Reinforcement Learning [86.1988266277766]
offline reinforcement learning algorithms have proven effective on datasets highly connected to the target downstream task.
We show that existing methods struggle with diverse data: their performance considerably deteriorates as data collected for related but different tasks is simply added to the offline buffer.
We show that scale, more than algorithmic considerations, is the key factor influencing performance.
arXiv Detail & Related papers (2024-03-19T18:57:53Z) - Bridging Imitation and Online Reinforcement Learning: An Optimistic Tale [27.02990488317357]
Given an offline demonstration dataset from an imperfect expert, what is the best way to leverage it to bootstrap online learning performance in MDPs?
We first propose an Informed Posterior Sampling-based RL (iPSRL) algorithm that uses the offline dataset, and information about the expert's behavioral policy used to generate the offline dataset.
Since this algorithm is computationally impractical, we then propose the iRLSVI algorithm that can be seen as a combination of the RLSVI algorithm for online RL, and imitation learning.
arXiv Detail & Related papers (2023-03-20T18:16:25Z) - Efficient Online Reinforcement Learning with Offline Data [78.92501185886569]
We show that we can simply apply existing off-policy methods to leverage offline data when learning online.
We extensively ablate these design choices, demonstrating the key factors that most affect performance.
We see that correct application of these simple recommendations can provide a $mathbf2.5times$ improvement over existing approaches.
arXiv Detail & Related papers (2023-02-06T17:30:22Z) - Artificial Replay: A Meta-Algorithm for Harnessing Historical Data in Bandits [34.42192958753171]
We propose Artificial-Replay, a meta-algorithm for incorporating historical data into arbitrary base bandit algorithms.
We show that Artificial-Replay uses only a fraction of the historical data compared to a full warm-start approach.
arXiv Detail & Related papers (2022-09-30T18:03:31Z) - Shuffled linear regression through graduated convex relaxation [12.614901374282868]
The shuffled linear regression problem aims to recover linear relationships in datasets where the correspondence between input and output is unknown.
This problem arises in a wide range of applications including survey data.
We propose a novel optimization algorithm for shuffled linear regression based on a posterior-maximizing objective function.
arXiv Detail & Related papers (2022-09-30T17:33:48Z) - Provably Efficient Causal Reinforcement Learning with Confounded
Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting.
We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.