Adversarial Rewards in Universal Learning for Contextual Bandits
- URL: http://arxiv.org/abs/2302.07186v2
- Date: Mon, 12 Jun 2023 16:52:27 GMT
- Title: Adversarial Rewards in Universal Learning for Contextual Bandits
- Authors: Moise Blanchard, Steve Hanneke and Patrick Jaillet
- Abstract summary: We study the limits of learning in contextual bandits, where a learner's rewards depend on their actions and a known context.
We show that optimistic universal learning for contextual bandits with adversarial rewards is impossible in general.
- Score: 32.14208422566497
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the fundamental limits of learning in contextual bandits, where a
learner's rewards depend on their actions and a known context, which extends
the canonical multi-armed bandit to the case where side-information is
available. We are interested in universally consistent algorithms, which
achieve sublinear regret compared to any measurable fixed policy, without any
function class restriction. For stationary contextual bandits, when the
underlying reward mechanism is time-invariant, Blanchard et. al (2022)
characterized learnable context processes for which universal consistency is
achievable; and further gave algorithms ensuring universal consistency whenever
this is achievable, a property known as optimistic universal consistency. It is
well understood, however, that reward mechanisms can evolve over time, possibly
adversarially, and depending on the learner's actions. We show that optimistic
universal learning for contextual bandits with adversarial rewards is
impossible in general, contrary to all previously studied settings in online
learning -- including standard supervised learning. We also give necessary and
sufficient conditions for universal learning under various adversarial reward
models, and an exact characterization for online rewards. In particular, the
set of learnable processes for these reward models is still extremely general
-- larger than i.i.d., stationary or ergodic -- but in general strictly smaller
than that for supervised learning or stationary contextual bandits, shedding
light on new adversarial phenomena.
Related papers
- Contextual Bandits and Optimistically Universal Learning [32.14208422566497]
We focus on consistency -- vanishing regret compared to the optimal policy.
We show that for large classes of non-i.i.d. contexts, consistency can be achieved regardless of the time-invariant reward mechanism.
arXiv Detail & Related papers (2022-12-31T16:15:28Z) - On the Complexity of Representation Learning in Contextual Linear
Bandits [110.84649234726442]
We show that representation learning is fundamentally more complex than linear bandits.
In particular, learning with a given set of representations is never simpler than learning with the worst realizable representation in the set.
arXiv Detail & Related papers (2022-12-19T13:08:58Z) - Modeling Multiple Views via Implicitly Preserving Global Consistency and
Local Complementarity [61.05259660910437]
We propose a global consistency and complementarity network (CoCoNet) to learn representations from multiple views.
On the global stage, we reckon that the crucial knowledge is implicitly shared among views, and enhancing the encoder to capture such knowledge can improve the discriminability of the learned representations.
Lastly on the local stage, we propose a complementarity-factor, which joints cross-view discriminative knowledge, and it guides the encoders to learn not only view-wise discriminability but also cross-view complementary information.
arXiv Detail & Related papers (2022-09-16T09:24:00Z) - Universal Regression with Adversarial Responses [26.308541799686505]
We provide algorithms for regression with adversarial responses under large classes of non-i.i.d. instance sequences.
We consider universal consistency which asks for strong consistency of a learner without restrictions on the value responses.
arXiv Detail & Related papers (2022-03-09T22:10:30Z) - Universal Online Learning: an Optimistically Universal Learning Rule [0.0]
We study the subject of universal online learning with non-i.i.d. processes for bounded losses.
We show that k-nearest neighbor algorithm (kNN) is not optimistically universal and present a novel variant of 1NN.
arXiv Detail & Related papers (2022-01-16T02:13:47Z) - Constructing a Good Behavior Basis for Transfer using Generalized Policy
Updates [63.58053355357644]
We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks.
We show theoretically that having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance.
arXiv Detail & Related papers (2021-12-30T12:20:46Z) - Instance-Dependent Complexity of Contextual Bandits and Reinforcement
Learning: A Disagreement-Based Perspective [104.67295710363679]
In the classical multi-armed bandit problem, instance-dependent algorithms attain improved performance on "easy" problems with a gap between the best and second-best arm.
We introduce a family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds.
We then introduce new oracle-efficient algorithms which adapt to the gap whenever possible, while also attaining the minimax rate in the worst case.
arXiv Detail & Related papers (2020-10-07T01:33:06Z) - Contextual Bandit with Missing Rewards [27.066965426355257]
We consider a novel variant of the contextual bandit problem where the reward associated with each context-based decision may not always be observed.
This new problem is motivated by certain online settings including clinical trial and ad recommendation applications.
We propose to combine the standard contextual bandit approach with an unsupervised learning mechanism such as clustering.
arXiv Detail & Related papers (2020-07-13T13:29:51Z) - Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data.
Can we learn effective policies via supervised learning without demonstrations?
We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.