Sequential Monte Carlo Bandits
- URL: http://arxiv.org/abs/1808.02933v4
- Date: Thu, 4 Apr 2024 14:00:42 GMT
- Title: Sequential Monte Carlo Bandits
- Authors: IƱigo Urteaga, Chris H. Wiggins,
- Abstract summary: We extend Bayesian multi-armed bandit (MAB) algorithms beyond their original setting by making use of sequential Monte Carlo (SMC) methods.
A MAB is a sequential decision making problem where the goal is to learn a policy that maximizes long term payoff.
We showcase how non-stationary bandits, where time dynamics are modeled via linear dynamical systems, can be successfully addressed.
- Score: 1.9205272414658485
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We extend Bayesian multi-armed bandit (MAB) algorithms beyond their original setting by making use of sequential Monte Carlo (SMC) methods. A MAB is a sequential decision making problem where the goal is to learn a policy that maximizes long term payoff, where only the reward of the executed action is observed. In the stochastic MAB, the reward for each action is generated from an unknown distribution, often assumed to be stationary. To decide which action to take next, a MAB agent must learn the characteristics of the unknown reward distribution, e.g., compute its sufficient statistics. However, closed-form expressions for these statistics are analytically intractable except for simple, stationary cases. We here utilize SMC for estimation of the statistics Bayesian MAB agents compute, and devise flexible policies that can address a rich class of bandit problems: i.e., MABs with nonlinear, stateless- and context-dependent reward distributions that evolve over time. We showcase how non-stationary bandits, where time dynamics are modeled via linear dynamical systems, can be successfully addressed by SMC-based Bayesian bandit agents. We empirically demonstrate good regret performance of the proposed SMC-based bandit policies in several MAB scenarios that have remained elusive, i.e., in non-stationary bandits with nonlinear rewards.
Related papers
- Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds.
We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$.
Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z) - Forced Exploration in Bandit Problems [12.13966146283641]
The multi-armed bandit(MAB) is a classical sequential decision problem.
This paper aims to design a multi-armed bandit algorithm that can be implemented without using information about the reward distribution.
arXiv Detail & Related papers (2023-12-12T14:00:29Z) - On Imperfect Recall in Multi-Agent Influence Diagrams [57.21088266396761]
Multi-agent influence diagrams (MAIDs) are a popular game-theoretic model based on Bayesian networks.
We show how to solve MAIDs with forgetful and absent-minded agents using mixed policies and two types of correlated equilibrium.
We also describe applications of MAIDs to Markov games and team situations, where imperfect recall is often unavoidable.
arXiv Detail & Related papers (2023-07-11T07:08:34Z) - Implicitly normalized forecaster with clipping for linear and non-linear
heavy-tailed multi-armed bandits [85.27420062094086]
Implicitly Normalized Forecaster (INF) is considered an optimal solution for adversarial multi-armed bandit (MAB) problems.
We propose a new version of INF called the Implicitly Normalized Forecaster with clipping (INFclip) for MAB problems with heavy-tailed settings.
We demonstrate that INFclip is optimal for linear heavy-tailed MAB problems and works well for non-linear ones.
arXiv Detail & Related papers (2023-05-11T12:00:43Z) - Stochastic Rising Bandits [40.32303434592863]
We study a particular case of the rested and restless bandits in which the arms' expected payoff is monotonically non-decreasing.
This characteristic allows designing specifically crafted algorithms that exploit the regularity of the payoffs to provide tight regret bounds.
We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset.
arXiv Detail & Related papers (2022-12-07T17:30:45Z) - Understanding the stochastic dynamics of sequential decision-making
processes: A path-integral analysis of multi-armed bandits [7.05949591248206]
The multi-armed bandit (MAB) model is one of the most popular models to study decision-making in an uncertain environment.
In this paper, we employ techniques in statistical physics to analyze the MAB model.
arXiv Detail & Related papers (2022-08-11T09:32:03Z) - Reinforcement Learning for Finite-Horizon Restless Multi-Armed
Multi-Action Bandits [8.136957953239254]
We study a finite-horizon restless multi-armed bandit problem with multiple actions dubbed R(MA)2B.
The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken.
Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy which we call Occupancy-Measured-Reward Index Policy.
arXiv Detail & Related papers (2021-09-20T21:40:12Z) - Mean Field Equilibrium in Multi-Armed Bandit Game with Continuous Reward [4.2710814397148]
Mean field game facilitates analyzing multi-armed bandit (MAB) for a large number of agents by approximating their interactions with an average effect.
Existing mean field models for multi-agent MAB mostly assume a binary reward function, which leads to tractable analysis.
In this paper, we study the mean field bandit game with a continuous reward function.
arXiv Detail & Related papers (2021-05-03T11:50:06Z) - Learning from eXtreme Bandit Feedback [105.0383130431503]
We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces.
In this paper, we introduce a selective importance sampling estimator (sIS) that operates in a significantly more favorable bias-variance regime.
We employ this estimator in a novel algorithmic procedure -- named Policy Optimization for eXtreme Models (POXM) -- for learning from bandit feedback on XMC tasks.
arXiv Detail & Related papers (2020-09-27T20:47:25Z) - Robustness Guarantees for Mode Estimation with an Application to Bandits [131.21717367564963]
We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean.
We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences.
arXiv Detail & Related papers (2020-03-05T21:29:27Z) - Thompson Sampling Algorithms for Mean-Variance Bandits [97.43678751629189]
We develop Thompson Sampling-style algorithms for mean-variance MAB.
We also provide comprehensive regret analyses for Gaussian and Bernoulli bandits.
Our algorithms significantly outperform existing LCB-based algorithms for all risk tolerances.
arXiv Detail & Related papers (2020-02-01T15:33:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.