Bayesian Online Multiple Testing: A Resource Allocation Approach
- URL: http://arxiv.org/abs/2402.11425v4
- Date: Tue, 16 Jul 2024 01:00:15 GMT
- Title: Bayesian Online Multiple Testing: A Resource Allocation Approach
- Authors: Ruicheng Ao, Hongyu Chen, David Simchi-Levi, Feng Zhu,
- Abstract summary: We consider the problem of sequentially conducting experiments where each experiment corresponds to a hypothesis testing task.
The goal is to maximize the number of discoveries while maintaining a low error rate at all time points measured by Local False Discovery Rate (LFDR)
We propose a novel policy that incorporates budget safety buffers.
- Score: 21.23710184381363
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of sequentially conducting multiple experiments where each experiment corresponds to a hypothesis testing task. At each time point, the experimenter must make an irrevocable decision of whether to reject the null hypothesis (or equivalently claim a discovery) before the next experimental result arrives. The goal is to maximize the number of discoveries while maintaining a low error rate at all time points measured by Local False Discovery Rate (LFDR). We formulate the problem as an online knapsack problem with exogenous random budget replenishment. We start with general arrival distributions and show that a simple policy achieves a $O(\sqrt{T})$ regret. We complement the result by showing that such regret rate is in general not improvable. We then shift our focus to discrete arrival distributions. We find that many existing re-solving heuristics in the online resource allocation literature, albeit achieve bounded loss in canonical settings, may incur a $\Omega(\sqrt{T})$ or even a $\Omega(T)$ regret. With the observation that canonical policies tend to be too optimistic and over claim discoveries, we propose a novel policy that incorporates budget safety buffers. It turns out that a little more safety can greatly enhance efficiency -- small additional logarithmic buffers suffice to reduce the regret from $\Omega(\sqrt{T})$ or even $\Omega(T)$ to $O(\ln^2 T)$. From a practical perspective, we extend the policy to the scenario with continuous arrival distributions, time-dependent information structures, as well as unknown $T$. We conduct both synthetic experiments and empirical applications on a time series data from New York City taxi passengers to validate the performance of our proposed policies. Our results emphasize how effective policies should be designed in online resource allocation problems with exogenous budget replenishment.
Related papers
- Distributionally Robust Policy Learning under Concept Drifts [33.44768994272614]
This paper studies a more nuanced problem -- robust policy learning under the concept drift.
We first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy.
We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class.
arXiv Detail & Related papers (2024-12-18T19:53:56Z) - Oracle-Efficient Reinforcement Learning for Max Value Ensembles [7.404901768256101]
Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, theoretically and experimentally.
In this work we aim to compete with the $textitmax-following policy$, which at each state follows the action of whichever constituent policy has the highest value.
Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies.
arXiv Detail & Related papers (2024-05-27T01:08:23Z) - Online Policy Learning and Inference by Matrix Completion [12.527541242185404]
We formulate the problem as a matrix completion bandit (MCB)
We explore the $epsilon$-greedy bandit and the online gradient descent descent.
A faster decaying exploration yields smaller regret but learns the optimal policy less accurately.
arXiv Detail & Related papers (2024-04-26T13:19:27Z) - Thompson Exploration with Best Challenger Rule in Best Arm
Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework.
We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z) - Estimating Optimal Policy Value in General Linear Contextual Bandits [50.008542459050155]
In many bandit problems, the maximal reward achievable by a policy is often unknown in advance.
We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable.
We present a more practical, computationally efficient algorithm that estimates a problem-dependent upper bound on $V*$.
arXiv Detail & Related papers (2023-02-19T01:09:24Z) - Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits [81.60136088841948]
We propose an algorithm that minimizes the regret over the horizon of time $T$.
The proposed algorithm is applicable to domains such as recommendation systems and transportation.
arXiv Detail & Related papers (2023-01-31T03:49:00Z) - Anytime-valid off-policy inference for contextual bandits [34.721189269616175]
Contextual bandit algorithms map observed contexts $X_t$ to actions $A_t$ over time.
It is often of interest to estimate the properties of a hypothetical policy that is different from the logging policy that was used to collect the data.
We present a comprehensive framework for OPE inference that relax unnecessary conditions made in some past works.
arXiv Detail & Related papers (2022-10-19T17:57:53Z) - A Simple and Optimal Policy Design with Safety against Heavy-Tailed Risk for Stochastic Bandits [27.058087400790555]
We study the multi-armed bandit problem and design new policies that enjoy both worst-case optimality for expected regret and light-tailed risk for regret distribution.
We find that from a managerial perspective, our new policy design yields better tail distributions and is preferable than celebrated policies.
arXiv Detail & Related papers (2022-06-07T02:10:30Z) - Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning [59.02006924867438]
Off-policy evaluation and learning (OPE/L) use offline observational data to make better decisions.
Recent work proposed distributionally robust OPE/L (DROPE/L) to remedy this, but the proposal relies on inverse-propensity weighting.
We propose the first DR algorithms for DROPE/L with KL-divergence uncertainty sets.
arXiv Detail & Related papers (2022-02-19T20:00:44Z) - Optimal Off-Policy Evaluation from Multiple Logging Policies [77.62012545592233]
We study off-policy evaluation from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling.
We find the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one.
arXiv Detail & Related papers (2020-10-21T13:43:48Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Differentiable Bandit Exploration [38.81737411000074]
We learn such policies for an unknown distribution $mathcalP$ using samples from $mathcalP$.
Our approach is a form of meta-learning and exploits properties of $mathcalP$ without making strong assumptions about its form.
arXiv Detail & Related papers (2020-02-17T05:07:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.