Related papers: BOTS: Batch Bayesian Optimization of Extended Thompson Sampling for Severely Episode-Limited RL Settings

BOTS: Batch Bayesian Optimization of Extended Thompson Sampling for Severely Episode-Limited RL Settings

URL: http://arxiv.org/abs/2412.00308v1
Date: Sat, 30 Nov 2024 01:27:44 GMT
Title: BOTS: Batch Bayesian Optimization of Extended Thompson Sampling for Severely Episode-Limited RL Settings
Authors: Karine Karine, Susan A. Murphy, Benjamin M. Marlin,
Abstract summary: We extend the linear Thompson sampling bandit to select actions based on a state-action utility function.<n>We show that the proposed method can significantly out-perform standard Thompson sampling in terms of total return.
Score: 11.008537121214104
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In settings where the application of reinforcement learning (RL) requires running real-world trials, including the optimization of adaptive health interventions, the number of episodes available for learning can be severely limited due to cost or time constraints. In this setting, the bias-variance trade-off of contextual bandit methods can be significantly better than that of more complex full RL methods. However, Thompson sampling bandits are limited to selecting actions based on distributions of immediate rewards. In this paper, we extend the linear Thompson sampling bandit to select actions based on a state-action utility function consisting of the Thompson sampler's estimate of the expected immediate reward combined with an action bias term. We use batch Bayesian optimization over episodes to learn the action bias terms with the goal of maximizing the expected return of the extended Thompson sampler. The proposed approach is able to learn optimal policies for a strictly broader class of Markov decision processes (MDPs) than standard Thompson sampling. Using an adaptive intervention simulation environment that captures key aspects of behavioral dynamics, we show that the proposed method can significantly out-perform standard Thompson sampling in terms of total return, while requiring significantly fewer episodes than standard value function and policy gradient methods.

Related papers

Adaptive Data Augmentation for Thompson Sampling [4.441866681085518]
In linear contextual bandits, the objective is to select actions that maximize cumulative rewards.<n>Thompson Sampling performs well empirically, but it does not achieve optimal regret bounds.<n>This paper proposes a nearly minimax optimal Thompson Sampling for linear contextual bandits.
arXiv Detail & Related papers (2025-06-17T12:57:33Z)
Counterfactual Inference under Thompson Sampling [3.988614978933934]
We derive exact and efficiently computable expressions for action propensities under a variety of parameter and outcome distributions. This opens up a range of practical use-cases where counterfactual inference is crucial, including offline evaluation of recommender systems.
arXiv Detail & Related papers (2025-04-03T14:31:40Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
TS-RSR: A provably efficient approach for batch bayesian optimization [4.622871908358325]
This paper presents a new approach for batch Bayesian Optimization (BO) called Thompson Sampling-Regret to Sigma Ratio directed sampling. Our sampling objective is able to coordinate the actions chosen in each batch in a way that minimizes redundancy between points. We demonstrate that our method attains state-of-the-art performance on a range of challenging synthetic and realistic test functions.
arXiv Detail & Related papers (2024-03-07T18:58:26Z)
Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL) We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z)
Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits [88.21288104408556]
We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family. We propose a Thompson sampling, termed Expulli, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm.
arXiv Detail & Related papers (2022-06-07T18:08:21Z)
Feel-Good Thompson Sampling for Contextual Bandits and Reinforcement Learning [17.860102738896096]
We present a theoretical analysis of Thompson Sampling, with a focus on frequentist regret bounds. We show that the standard Thompson Sampling is not aggressive enough in exploring new actions, leading to suboptimality in some pessimistic situations. We show that the theoretical framework can be used to derive Bayesian regret bounds for standard Thompson Sampling, and frequentist regret bounds for Feel-Good Thompson Sampling.
arXiv Detail & Related papers (2021-10-02T20:10:40Z)
Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment. Policy gradients for local search are often obtained from random perturbations. We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z)
Doubly-Adaptive Thompson Sampling for Multi-Armed and Contextual Bandits [28.504921333436833]
We propose a variant of a Thompson sampling based algorithm that adaptively re-weighs the terms of a doubly robust estimator on the true mean reward of each arm. The proposed algorithm matches the optimal (minimax) regret rate and its empirical evaluation in a semi-synthetic experiment. We extend this approach to contextual bandits, where there are more sources of bias present apart from the adaptive data collection.
arXiv Detail & Related papers (2021-02-25T22:29:25Z)
Asymptotic Convergence of Thompson Sampling [0.0]
Thompson sampling has been shown to be an effective policy across a variety of online learning tasks. We prove an convergence result for Thompson sampling under the assumption of a sub-linear Bayesian regret. Our results rely on the martingale structure inherent in Thompson sampling.
arXiv Detail & Related papers (2020-11-08T07:36:49Z)
Learning from eXtreme Bandit Feedback [105.0383130431503]
We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces. In this paper, we introduce a selective importance sampling estimator (sIS) that operates in a significantly more favorable bias-variance regime. We employ this estimator in a novel algorithmic procedure -- named Policy Optimization for eXtreme Models (POXM) -- for learning from bandit feedback on XMC tasks.
arXiv Detail & Related papers (2020-09-27T20:47:25Z)
Policy Gradient Optimization of Thompson Sampling Policies [3.3345263849085582]
We study the use of policy gradient algorithms to optimize over a class of generalized Thompson sampling policies. We show that direct policy search on top of Thompson sampling automatically corrects for some of the algorithm's known shortcomings.
arXiv Detail & Related papers (2020-06-30T03:27:22Z)
MOTS: Minimax Optimal Thompson Sampling [89.2370817955411]
It has remained an open problem whether Thompson sampling can match the minimax lower bound $Omega(sqrtKT)$ for $K$-armed bandit problems. We propose a variant of Thompson sampling called MOTS that adaptively clips the sampling instance of the chosen arm at each time step. We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound $O(sqrtKT)$ for finite time horizon $T$, as well as the optimal regret bound for Gaussian rewards when $T$ approaches infinity.
arXiv Detail & Related papers (2020-03-03T21:24:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.