Thompson Sampling in Partially Observable Contextual Bandits
- URL: http://arxiv.org/abs/2402.10289v1
- Date: Thu, 15 Feb 2024 19:37:39 GMT
- Title: Thompson Sampling in Partially Observable Contextual Bandits
- Authors: Hongju Park and Mohamad Kazem Shirani Faradonbeh
- Abstract summary: We study bandit policies for learning to select optimal arms based on the data of observations.
Our theoretical analysis shows that the Thompson sampling policy successfully balances exploration and exploitation.
These techniques pave the road towards studying other decision-making problems with contextual information as well as partial observations.
- Score: 2.465689259704613
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contextual bandits constitute a classical framework for decision-making under
uncertainty. In this setting, the goal is to learn the arms of highest reward
subject to contextual information, while the unknown reward parameters of each
arm need to be learned by experimenting that specific arm. Accordingly, a
fundamental problem is that of balancing exploration (i.e., pulling different
arms to learn their parameters), versus exploitation (i.e., pulling the best
arms to gain reward). To study this problem, the existing literature mostly
considers perfectly observed contexts. However, the setting of partial context
observations remains unexplored to date, despite being theoretically more
general and practically more versatile. We study bandit policies for learning
to select optimal arms based on the data of observations, which are noisy
linear functions of the unobserved context vectors. Our theoretical analysis
shows that the Thompson sampling policy successfully balances exploration and
exploitation. Specifically, we establish the followings: (i) regret bounds that
grow poly-logarithmically with time, (ii) square-root consistency of parameter
estimation, and (iii) scaling of the regret with other quantities including
dimensions and number of arms. Extensive numerical experiments with both real
and synthetic data are presented as well, corroborating the efficacy of
Thompson sampling. To establish the results, we introduce novel martingale
techniques and concentration inequalities to address partially observed
dependent random variables generated from unspecified distributions, and also
leverage problem-dependent information to sharpen probabilistic bounds for
time-varying suboptimality gaps. These techniques pave the road towards
studying other decision-making problems with contextual information as well as
partial observations.
Related papers
- Sequential Manipulation Against Rank Aggregation: Theory and Algorithm [119.57122943187086]
We leverage an online attack on the vulnerable data collection process.
From the game-theoretic perspective, the confrontation scenario is formulated as a distributionally robust game.
The proposed method manipulates the results of rank aggregation methods in a sequential manner.
arXiv Detail & Related papers (2024-07-02T03:31:21Z) - Worst-case Performance of Greedy Policies in Bandits with Imperfect
Context Observations [1.370633147306388]
This work considers Greedy reinforcement learning policies that take actions as if the current estimates of the parameter and of the unobserved contexts coincide with the corresponding true values.
We establish that the non-asymptotic worst-case regret grows logarithmically with the time horizon and the failure probability, while it scales linearly with the number of arms.
arXiv Detail & Related papers (2022-04-10T21:27:56Z) - From Optimality to Robustness: Dirichlet Sampling Strategies in
Stochastic Bandits [0.0]
We study a generic Dirichlet Sampling (DS) algorithm, based on pairwise comparisons of empirical indices computed with re-sampling of the arms' observations.
We show that different variants of this strategy achieve provably optimal regret guarantees when the distributions are bounded and logarithmic regret for semi-bounded distributions with a mild quantile condition.
arXiv Detail & Related papers (2021-11-18T14:34:21Z) - Analysis of Thompson Sampling for Partially Observable Contextual
Multi-Armed Bandits [1.8275108630751844]
We propose a Thompson Sampling algorithm for partially observable contextual multi-armed bandits.
We show that the regret of the presented policy scales logarithmically with time and the number of arms, and linearly with the dimension.
arXiv Detail & Related papers (2021-10-23T08:51:49Z) - Residual Overfit Method of Exploration [78.07532520582313]
We propose an approximate exploration methodology based on fitting only two point estimates, one tuned and one overfit.
The approach drives exploration towards actions where the overfit model exhibits the most overfitting compared to the tuned model.
We compare ROME against a set of established contextual bandit methods on three datasets and find it to be one of the best performing.
arXiv Detail & Related papers (2021-10-06T17:05:33Z) - Learning with Instance Bundles for Reading Comprehension [61.823444215188296]
We introduce new supervision techniques that compare question-answer scores across multiple related instances.
Specifically, we normalize these scores across various neighborhoods of closely contrasting questions and/or answers.
We empirically demonstrate the effectiveness of training with instance bundles on two datasets.
arXiv Detail & Related papers (2021-04-18T06:17:54Z) - Doubly-Adaptive Thompson Sampling for Multi-Armed and Contextual Bandits [28.504921333436833]
We propose a variant of a Thompson sampling based algorithm that adaptively re-weighs the terms of a doubly robust estimator on the true mean reward of each arm.
The proposed algorithm matches the optimal (minimax) regret rate and its empirical evaluation in a semi-synthetic experiment.
We extend this approach to contextual bandits, where there are more sources of bias present apart from the adaptive data collection.
arXiv Detail & Related papers (2021-02-25T22:29:25Z) - Continuous Mean-Covariance Bandits [39.820490484375156]
We propose a novel Continuous Mean-Covariance Bandit model to take into account option correlation.
In CMCB, there is a learner who sequentially chooses weight vectors on given options and observes random feedback according to the decisions.
We propose novel algorithms with optimal regrets (within logarithmic factors) and provide matching lower bounds to validate their optimalities.
arXiv Detail & Related papers (2021-02-24T06:37:05Z) - Online Model Selection: a Rested Bandit Formulation [49.69377391589057]
We introduce and analyze a best arm identification problem in the rested bandit setting.
We define a novel notion of regret for this problem, where we compare to the policy that always plays the arm having the smallest expected loss at the end of the game.
Unlike known model selection efforts in the recent bandit literature, our algorithm exploits the specific structure of the problem to learn the unknown parameters of the expected loss function.
arXiv Detail & Related papers (2020-12-07T08:23:08Z) - Robustness Guarantees for Mode Estimation with an Application to Bandits [131.21717367564963]
We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean.
We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences.
arXiv Detail & Related papers (2020-03-05T21:29:27Z) - The Price of Incentivizing Exploration: A Characterization via Thompson
Sampling and Sample Complexity [83.81297078039836]
We consider incentivized exploration: a version of multi-armed bandits where the choice of arms is controlled by self-interested agents.
We focus on the price of incentives: the loss in performance, broadly construed, incurred for the sake of incentive-compatibility.
arXiv Detail & Related papers (2020-02-03T04:58:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.