Online Learning of Optimal Sequential Testing Policies
- URL: http://arxiv.org/abs/2509.03707v1
- Date: Wed, 03 Sep 2025 20:44:32 GMT
- Title: Online Learning of Optimal Sequential Testing Policies
- Authors: Qiyuan Chen, Raed Al Kontar,
- Abstract summary: We study an online learning problem that seeks optimal testing policies for a stream of subjects.<n>Although conducting every candidate test for a subject provides more information, it is often preferable to select only a subset.<n>We prove that the minimax regret must scale at least as $Omega(Tfrac23)$, in contrast to the $Theta(sqrtT)$ rate in episodic MDPs.
- Score: 7.8024154978341365
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies an online learning problem that seeks optimal testing policies for a stream of subjects, each of whom can be evaluated through a sequence of candidate tests drawn from a common pool. We refer to this problem as the Online Testing Problem (OTP). Although conducting every candidate test for a subject provides more information, it is often preferable to select only a subset when tests are correlated and costly, and make decisions with partial information. If the joint distribution of test outcomes were known, the problem could be cast as a Markov Decision Process (MDP) and solved exactly. In practice, this distribution is unknown and must be learned online as subjects are tested. When a subject is not fully tested, the resulting missing data can bias estimates, making the problem fundamentally harder than standard episodic MDPs. We prove that the minimax regret must scale at least as $\Omega(T^{\frac{2}{3}})$, in contrast to the $\Theta(\sqrt{T})$ rate in episodic MDPs, revealing the difficulty introduced by missingness. This elevated lower bound is then matched by an Explore-Then-Commit algorithm whose cumulative regret is $\tilde{O}(T^{\frac{2}{3}})$ for both discrete and Gaussian distributions. To highlight the consequence of missingness-dependent rewards in OTP, we study a variant called the Online Cost-sensitive Maximum Entropy Sampling Problem, where rewards are independent of missing data. This structure enables an iterative-elimination algorithm that achieves $\tilde{O}(\sqrt{T})$ regret, breaking the $\Omega(T^{\frac{2}{3}})$ lower bound for OTP. Numerical results confirm our theory in both settings. Overall, this work deepens the understanding of the exploration--exploitation trade-off under missing data and guides the design of efficient sequential testing policies.
Related papers
- Online Learning with Probing for Sequential User-Centric Selection [8.45399786458738]
We present a probing-augmented user-centric selection (PUCS) framework, where a learner first probes a subset of arms to obtain side information on resources and rewards, and then assigns $K$ plays to $M$ arms.<n>For the offline setting with known distributions, we present a greedy probing algorithm with a constant-factor approximation guarantee $zeta = (e-1)/ (2e-1)$.<n>For the online setting with unknown distributions, we introduce OLPA, a bandit algorithm that a regret a bound $mathcalO(sqrtT +
arXiv Detail & Related papers (2025-07-27T03:32:51Z) - On Traceability in $\ell_p$ Stochastic Convex Optimization [34.23960368886818]
We say a learning algorithm is $m$-traceable if, by analyzing its output, it is possible to identify at least $m$ of its training samples.<n>Our main results uncover a fundamental tradeoff between traceability and excess risk in SCO.
arXiv Detail & Related papers (2025-02-24T18:10:06Z) - Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization [65.8915778873691]
conditional distributions is a central problem in machine learning.<n>We propose a new paradigm that integrates both paired and unpaired data.<n>We show that our approach can theoretically recover true conditional distributions with arbitrarily small error.
arXiv Detail & Related papers (2024-10-03T16:12:59Z) - Bayesian Online Multiple Testing: A Resource Allocation Approach [21.23710184381363]
We consider the problem of sequentially conducting experiments where each experiment corresponds to a hypothesis testing task.
The goal is to maximize the number of discoveries while maintaining a low error rate at all time points measured by Local False Discovery Rate (LFDR)
We propose a novel policy that incorporates budget safety buffers.
arXiv Detail & Related papers (2024-02-18T02:11:54Z) - Collaborative non-parametric two-sample testing [55.98760097296213]
The goal is to identify nodes where the null hypothesis $p_v = q_v$ should be rejected.
We propose the non-parametric collaborative two-sample testing (CTST) framework that efficiently leverages the graph structure.
Our methodology integrates elements from f-divergence estimation, Kernel Methods, and Multitask Learning.
arXiv Detail & Related papers (2024-02-08T14:43:56Z) - Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost.<n>We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z) - Active Cost-aware Labeling of Streaming Data [11.501619634838312]
We study actively labeling streaming data, where an active learner is faced with a stream of data points and must carefully choose which of these points to label via an expensive experiment.
We first study a setting when the data's inputs belong to one of $K$ discrete distributions and formalize this problem via a loss that captures the labeling cost and the prediction error.
When the labeling cost is $B$, our algorithm, which chooses to label a point if the uncertainty is larger than a time and cost dependent threshold, achieves a worst-case upper bound of $widetildeO(Bfrac1
arXiv Detail & Related papers (2023-04-13T20:23:27Z) - Improved Sample Complexity Bounds for Distributionally Robust
Reinforcement Learning [3.222802562733787]
We consider the problem of learning a control policy that is robust against the parameter mismatches between the training environment and testing environment.
We propose the Robust Phased Value Learning (RPVL) algorithm to solve this problem for the uncertainty sets specified by four different divergences.
We show that our algorithm achieves $tildemathcalO(|mathcalSmathcalA| H5)$ sample complexity, which is uniformly better than the existing results.
arXiv Detail & Related papers (2023-03-05T21:47:08Z) - Sequential Kernelized Independence Testing [77.237958592189]
We design sequential kernelized independence tests inspired by kernelized dependence measures.<n>We demonstrate the power of our approaches on both simulated and real data.
arXiv Detail & Related papers (2022-12-14T18:08:42Z) - Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z) - Optimal Testing of Discrete Distributions with High Probability [49.19942805582874]
We study the problem of testing discrete distributions with a focus on the high probability regime.
We provide the first algorithms for closeness and independence testing that are sample-optimal, within constant factors.
arXiv Detail & Related papers (2020-09-14T16:09:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.