Odds-Ratio Thompson Sampling to Control for Time-Varying Effect
- URL: http://arxiv.org/abs/2003.01905v1
- Date: Wed, 4 Mar 2020 05:48:21 GMT
- Title: Odds-Ratio Thompson Sampling to Control for Time-Varying Effect
- Authors: Sulgi Kim and Kyungmin Kim
- Abstract summary: Multi-armed bandit methods have been used for dynamic experiments particularly in online services.
Many thompson sampling methods for binary rewards use logistic model that is written in a specific parameterization.
We propose a novel method, "Odds-ratio thompson sampling", which is expected to work robust to time-varying effect.
- Score: 7.547547344228166
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-armed bandit methods have been used for dynamic experiments
particularly in online services. Among the methods, thompson sampling is widely
used because it is simple but shows desirable performance. Many thompson
sampling methods for binary rewards use logistic model that is written in a
specific parameterization. In this study, we reparameterize logistic model with
odds ratio parameters. This shows that thompson sampling can be used with
subset of parameters. Based on this finding, we propose a novel method,
"Odds-ratio thompson sampling", which is expected to work robust to
time-varying effect. Use of the proposed method in continuous experiment is
described with discussing a desirable property of the method. In simulation
studies, the novel method works robust to temporal background effect, while the
loss of performance was only marginal in case with no such effect. Finally,
using dataset from real service, we showed that the novel method would gain
greater rewards in practical environment.
Related papers
- Neural Flow Samplers with Shortcut Models [19.81513273510523]
Flow-based samplers generate samples by learning a velocity field that satisfies the continuity equation.
While importance sampling provides an approximation, it suffers from high variance.
arXiv Detail & Related papers (2025-02-11T07:55:41Z) - BOTS: Batch Bayesian Optimization of Extended Thompson Sampling for Severely Episode-Limited RL Settings [11.008537121214104]
We extend the linear Thompson sampling bandit to select actions based on a state-action utility function.
We show that the proposed method can significantly out-perform standard Thompson sampling in terms of total return.
arXiv Detail & Related papers (2024-11-30T01:27:44Z) - Adaptive Experimentation When You Can't Experiment [55.86593195947978]
This paper introduces the emphconfounded pure exploration transductive linear bandit (textttCPET-LB) problem.
Online services can employ a properly randomized encouragement that incentivizes users toward a specific treatment.
arXiv Detail & Related papers (2024-06-15T20:54:48Z) - VITS : Variational Inference Thompson Sampling for contextual bandits [10.028119153832346]
We introduce and analyze a variant of the Thompson sampling (TS) algorithm for contextual bandits.
We propose a new algorithm, Varational Inference Thompson sampling VITS, based on Gaussian Variational Inference.
We show that VITS achieves a sub-linear regret bound of the same order in the dimension and number of round as traditional TS for linear contextual bandit.
arXiv Detail & Related papers (2023-07-19T17:53:22Z) - Sample and Predict Your Latent: Modality-free Sequential Disentanglement
via Contrastive Estimation [2.7759072740347017]
We introduce a self-supervised sequential disentanglement framework based on contrastive estimation with no external signals.
In practice, we propose a unified, efficient, and easy-to-code sampling strategy for semantically similar and dissimilar views of the data.
Our method presents state-of-the-art results in comparison to existing techniques.
arXiv Detail & Related papers (2023-05-25T10:50:30Z) - A Provably Efficient Model-Free Posterior Sampling Method for Episodic
Reinforcement Learning [50.910152564914405]
Existing posterior sampling methods for reinforcement learning are limited by being model-based or lack worst-case theoretical guarantees beyond linear MDPs.
This paper proposes a new model-free formulation of posterior sampling that applies to more general episodic reinforcement learning problems with theoretical guarantees.
arXiv Detail & Related papers (2022-08-23T12:21:01Z) - Langevin Monte Carlo for Contextual Bandits [72.00524614312002]
Langevin Monte Carlo Thompson Sampling (LMC-TS) is proposed to directly sample from the posterior distribution in contextual bandits.
We prove that the proposed algorithm achieves the same sublinear regret bound as the best Thompson sampling algorithms for a special case of contextual bandits.
arXiv Detail & Related papers (2022-06-22T17:58:23Z) - Jo-SRC: A Contrastive Approach for Combating Noisy Labels [58.867237220886885]
We propose a noise-robust approach named Jo-SRC (Joint Sample Selection and Model Regularization based on Consistency)
Specifically, we train the network in a contrastive learning manner. Predictions from two different views of each sample are used to estimate its "likelihood" of being clean or out-of-distribution.
arXiv Detail & Related papers (2021-03-24T07:26:07Z) - Doubly-Adaptive Thompson Sampling for Multi-Armed and Contextual Bandits [28.504921333436833]
We propose a variant of a Thompson sampling based algorithm that adaptively re-weighs the terms of a doubly robust estimator on the true mean reward of each arm.
The proposed algorithm matches the optimal (minimax) regret rate and its empirical evaluation in a semi-synthetic experiment.
We extend this approach to contextual bandits, where there are more sources of bias present apart from the adaptive data collection.
arXiv Detail & Related papers (2021-02-25T22:29:25Z) - Optimal Off-Policy Evaluation from Multiple Logging Policies [77.62012545592233]
We study off-policy evaluation from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling.
We find the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one.
arXiv Detail & Related papers (2020-10-21T13:43:48Z) - Multi-Scale Positive Sample Refinement for Few-Shot Object Detection [61.60255654558682]
Few-shot object detection (FSOD) helps detectors adapt to unseen classes with few training instances.
We propose a Multi-scale Positive Sample Refinement (MPSR) approach to enrich object scales in FSOD.
MPSR generates multi-scale positive samples as object pyramids and refines the prediction at various scales.
arXiv Detail & Related papers (2020-07-18T09:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.