Bandit Social Learning: Exploration under Myopic Behavior
- URL: http://arxiv.org/abs/2302.07425v5
- Date: Thu, 10 Apr 2025 01:47:33 GMT
- Title: Bandit Social Learning: Exploration under Myopic Behavior
- Authors: Kiarash Banihashem, MohammadTaghi Hajiaghayi, Suho Shin, Aleksandrs Slivkins,
- Abstract summary: We study social learning dynamics motivated by reviews on online platforms.<n>Agents collectively follow a simple multi-armed bandit protocol, but each agent acts myopically, without regards to exploration.<n>We derive stark learning failures for any such behavior, and provide matching positive results.
- Score: 54.767961587919075
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study social learning dynamics motivated by reviews on online platforms. The agents collectively follow a simple multi-armed bandit protocol, but each agent acts myopically, without regards to exploration. We allow the greedy (exploitation-only) algorithm, as well as a wide range of behavioral biases. Specifically, we allow myopic behaviors that are consistent with (parameterized) confidence intervals for the arms' expected rewards. We derive stark learning failures for any such behavior, and provide matching positive results. The learning-failure results extend to Bayesian agents and Bayesian bandit environments. In particular, we obtain general, quantitatively strong results on failure of the greedy bandit algorithm, both for ``frequentist" and ``Bayesian" versions. Failure results known previously are quantitatively weak, and either trivial or very specialized. Thus, we provide a theoretical foundation for designing non-trivial bandit algorithms, \ie algorithms that intentionally explore, which has been missing from the literature. Our general behavioral model can be interpreted as agents' optimism or pessimism. The matching positive results entail a maximal allowed amount of optimism. Moreover, we find that no amount of pessimism helps against the learning failures, whereas even a small-but-constant fraction of extreme optimists avoids the failures and leads to near-optimal regret rates.
Related papers
- Optimal Multi-Objective Best Arm Identification with Fixed Confidence [62.36929749450298]
We consider a multi-armed bandit setting in which each arm yields an $M$-dimensional vector reward upon selection.
The end goal is to identify the best arm of em every objective in the shortest (expected) time subject to an upper bound on the probability of error.
We propose an algorithm that uses the novel idea of em surrogate proportions to sample the arms at each time step, eliminating the need to solve the max-min optimisation problem at each step.
arXiv Detail & Related papers (2025-01-23T12:28:09Z) - DROP: Distributional and Regular Optimism and Pessimism for Reinforcement Learning [6.20048328543366]
This paper introduces a novel theoretically-grounded model with optimism and pessimism, which is derived from control as inference.
Although the model showed poor learning performance, DROP showed excellent one in all tasks with high generality.
arXiv Detail & Related papers (2024-10-22T23:14:09Z) - Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms.
We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z) - Pure Exploration under Mediators' Feedback [63.56002444692792]
Multi-armed bandits are a sequential-decision-making framework, where, at each interaction step, the learner selects an arm and observes a reward.
We consider the scenario in which the learner has access to a set of mediators, each of which selects the arms on the agent's behalf according to a and possibly unknown policy.
We propose a sequential decision-making strategy for discovering the best arm under the assumption that the mediators' policies are known to the learner.
arXiv Detail & Related papers (2023-08-29T18:18:21Z) - Pure Exploration of Causal Bandits [9.77519365079468]
Causal bandit problem integrates causal inference with multi-armed bandits.
Online learning task: given a causal graph with unknown causal inference distributions, we can choose to either intervene one variable or do no intervention.
We provide first gap-dependent fully adaptive pure exploration algorithms on three types of causal models.
arXiv Detail & Related papers (2022-06-16T02:19:37Z) - Incentivizing Combinatorial Bandit Exploration [87.08827496301839]
Consider a bandit algorithm that recommends actions to self-interested users in a recommendation system.
Users are free to choose other actions and need to be incentivized to follow the algorithm's recommendations.
While the users prefer to exploit, the algorithm can incentivize them to explore by leveraging the information collected from the previous users.
arXiv Detail & Related papers (2022-06-01T13:46:25Z) - What killed the Convex Booster ? [70.04715330065275]
A landmark negative result of Long and Servedio established a worst-case spectacular failure of a supervised learning trio.
We argue that the source of the negative result lies in the dark side of a pervasive -- and otherwise prized -- aspect of ML: textit parameterisation.
arXiv Detail & Related papers (2022-05-19T15:42:20Z) - Breaking the Moments Condition Barrier: No-Regret Algorithm for Bandits
with Super Heavy-Tailed Payoffs [27.636407641546914]
We propose a novel robust statistical estimator, mean of medians, which estimates a random variable by computing the empirical mean of a sequence of empirical medians.
We show that the regret bound is near-optimal even with very heavy-tailed noise.
arXiv Detail & Related papers (2021-10-26T17:30:44Z) - Bellman-consistent Pessimism for Offline Reinforcement Learning [46.97637726255375]
We introduce the notion of Bellman-consistent pessimism for general function approximation.
Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting.
arXiv Detail & Related papers (2021-06-13T05:50:36Z) - Be Greedy in Multi-Armed Bandits [22.301793734117805]
The Greedy algorithm is the simplest in sequential decision problem that takes the locally optimal choice at each round.
We provide a generic worst-case bound on the regret of the Greedy algorithm.
We prove that it verifies near-optimal worst-case regret bounds in continuous, infinite and many-armed bandit problems.
arXiv Detail & Related papers (2021-01-04T16:47:02Z) - Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous
Feedback [32.62857394584907]
We study the multi-armed bandit (MAB) problem with composite and anonymous feedback.
We propose adaptive algorithms for both the adversarial and non- adversarial cases.
arXiv Detail & Related papers (2020-12-13T12:25:41Z) - The Combinatorial Multi-Bandit Problem and its Application to Energy
Management [2.236663830879273]
We study a Combinatorial Multi-Bandit Problem motivated by applications in energy systems management.
For our energy management application we propose a range of algorithms that combine exploration principles for multi-arm bandits with mathematical programming.
arXiv Detail & Related papers (2020-10-30T13:42:54Z) - Instance-Dependent Complexity of Contextual Bandits and Reinforcement
Learning: A Disagreement-Based Perspective [104.67295710363679]
In the classical multi-armed bandit problem, instance-dependent algorithms attain improved performance on "easy" problems with a gap between the best and second-best arm.
We introduce a family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds.
We then introduce new oracle-efficient algorithms which adapt to the gap whenever possible, while also attaining the minimax rate in the worst case.
arXiv Detail & Related papers (2020-10-07T01:33:06Z) - Latent Bandits Revisited [55.88616813182679]
A latent bandit problem is one in which the learning agent knows the arm reward distributions conditioned on an unknown discrete latent state.
We propose general algorithms for this setting, based on both upper confidence bounds (UCBs) and Thompson sampling.
We provide a unified theoretical analysis of our algorithms, which have lower regret than classic bandit policies when the number of latent states is smaller than actions.
arXiv Detail & Related papers (2020-06-15T19:24:02Z) - Excursion Search for Constrained Bayesian Optimization under a Limited
Budget of Failures [62.41541049302712]
We propose a novel decision maker grounded in control theory that controls the amount of risk we allow in the search as a function of a given budget of failures.
Our algorithm uses the failures budget more efficiently in a variety of optimization experiments, and generally achieves lower regret, than state-of-the-art methods.
arXiv Detail & Related papers (2020-05-15T09:54:09Z) - Robustness Guarantees for Mode Estimation with an Application to Bandits [131.21717367564963]
We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean.
We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences.
arXiv Detail & Related papers (2020-03-05T21:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.