Contextual Bandits and Optimistically Universal Learning
- URL: http://arxiv.org/abs/2301.00241v1
- Date: Sat, 31 Dec 2022 16:15:28 GMT
- Title: Contextual Bandits and Optimistically Universal Learning
- Authors: Moise Blanchard, Steve Hanneke and Patrick Jaillet
- Abstract summary: We focus on consistency -- vanishing regret compared to the optimal policy.
We show that for large classes of non-i.i.d. contexts, consistency can be achieved regardless of the time-invariant reward mechanism.
- Score: 32.14208422566497
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the contextual bandit problem on general action and context
spaces, where the learner's rewards depend on their selected actions and an
observable context. This generalizes the standard multi-armed bandit to the
case where side information is available, e.g., patients' records or customers'
history, which allows for personalized treatment. We focus on consistency --
vanishing regret compared to the optimal policy -- and show that for large
classes of non-i.i.d. contexts, consistency can be achieved regardless of the
time-invariant reward mechanism, a property known as universal consistency.
Precisely, we first give necessary and sufficient conditions on the
context-generating process for universal consistency to be possible. Second, we
show that there always exists an algorithm that guarantees universal
consistency whenever this is achievable, called an optimistically universal
learning rule. Interestingly, for finite action spaces, learnable processes for
universal learning are exactly the same as in the full-feedback setting of
supervised learning, previously studied in the literature. In other words,
learning can be performed with partial feedback without any generalization
cost. The algorithms balance a trade-off between generalization (similar to
structural risk minimization) and personalization (tailoring actions to
specific contexts). Lastly, we consider the case of added continuity
assumptions on rewards and show that these lead to universal consistency for
significantly larger classes of data-generating processes.
Related papers
- A Similarity Paradigm Through Textual Regularization Without Forgetting [17.251684463032433]
We propose a novel method called Similarity Paradigm with Textual Regularization (SPTR) for prompt learning without forgetting.
SPTR is a two-pronged design based on hand-crafted prompts that is an inseparable framework.
Four representative tasks across 11 datasets demonstrate that SPTR outperforms existing prompt learning methods.
arXiv Detail & Related papers (2025-02-20T09:06:44Z) - Zero-Shot Generalization during Instruction Tuning: Insights from Similarity and Granularity [84.12126298229866]
We show that zero-shot generalization during instruction tuning happens very early.
We also show that encountering highly similar and fine-grained training data earlier during instruction tuning, without the constraints of defined "tasks", enables better generalization.
For the first time, we show that zero-shot generalization during instruction tuning is a form of similarity-based generalization between training and test data at the instance level.
arXiv Detail & Related papers (2024-06-17T16:40:21Z) - Inferring Behavior-Specific Context Improves Zero-Shot Generalization in Reinforcement Learning [4.902544998453533]
We argue that understanding and utilizing contextual cues, such as the gravity level of the environment, is critical for robust generalization.
Our algorithm demonstrates improved generalization on various simulated domains, outperforming prior context-learning techniques in zero-shot settings.
arXiv Detail & Related papers (2024-04-15T07:31:48Z) - Adversarial Rewards in Universal Learning for Contextual Bandits [32.14208422566497]
We study the limits of learning in contextual bandits, where a learner's rewards depend on their actions and a known context.
We show that optimistic universal learning for contextual bandits with adversarial rewards is impossible in general.
arXiv Detail & Related papers (2023-02-14T16:54:22Z) - Modeling Multiple Views via Implicitly Preserving Global Consistency and
Local Complementarity [61.05259660910437]
We propose a global consistency and complementarity network (CoCoNet) to learn representations from multiple views.
On the global stage, we reckon that the crucial knowledge is implicitly shared among views, and enhancing the encoder to capture such knowledge can improve the discriminability of the learned representations.
Lastly on the local stage, we propose a complementarity-factor, which joints cross-view discriminative knowledge, and it guides the encoders to learn not only view-wise discriminability but also cross-view complementary information.
arXiv Detail & Related papers (2022-09-16T09:24:00Z) - Universal Regression with Adversarial Responses [26.308541799686505]
We provide algorithms for regression with adversarial responses under large classes of non-i.i.d. instance sequences.
We consider universal consistency which asks for strong consistency of a learner without restrictions on the value responses.
arXiv Detail & Related papers (2022-03-09T22:10:30Z) - Universal Online Learning: an Optimistically Universal Learning Rule [0.0]
We study the subject of universal online learning with non-i.i.d. processes for bounded losses.
We show that k-nearest neighbor algorithm (kNN) is not optimistically universal and present a novel variant of 1NN.
arXiv Detail & Related papers (2022-01-16T02:13:47Z) - Constructing a Good Behavior Basis for Transfer using Generalized Policy
Updates [63.58053355357644]
We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks.
We show theoretically that having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance.
arXiv Detail & Related papers (2021-12-30T12:20:46Z) - DisCo RL: Distribution-Conditioned Reinforcement Learning for
General-Purpose Policies [116.12670064963625]
We develop an off-policy algorithm called distribution-conditioned reinforcement learning (DisCo RL) to efficiently learn contextual policies.
We evaluate DisCo RL on a variety of robot manipulation tasks and find that it significantly outperforms prior methods on tasks that require generalization to new goal distributions.
arXiv Detail & Related papers (2021-04-23T16:51:58Z) - When Is Generalizable Reinforcement Learning Tractable? [74.87383727210705]
We study the query complexity required to train RL agents that can generalize to multiple environments.
We introduce Strong Proximity, a structural condition which precisely characterizes the relative closeness of different environments.
We show that under a natural weakening of this condition, RL can require query complexity that is exponential in the horizon to generalize.
arXiv Detail & Related papers (2021-01-01T19:08:24Z) - Instance-Dependent Complexity of Contextual Bandits and Reinforcement
Learning: A Disagreement-Based Perspective [104.67295710363679]
In the classical multi-armed bandit problem, instance-dependent algorithms attain improved performance on "easy" problems with a gap between the best and second-best arm.
We introduce a family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds.
We then introduce new oracle-efficient algorithms which adapt to the gap whenever possible, while also attaining the minimax rate in the worst case.
arXiv Detail & Related papers (2020-10-07T01:33:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.