Fugu-MT 論文翻訳(概要): Online Learning of Optimal Sequential Testing Policies

論文の概要: Online Learning of Optimal Sequential Testing Policies

arxiv url: http://arxiv.org/abs/2509.03707v1
Date: Wed, 03 Sep 2025 20:44:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-05 20:21:09.968457
Title: Online Learning of Optimal Sequential Testing Policies
Title（参考訳）: 最適シーケンステストのオンライン学習
Authors: Qiyuan Chen, Raed Al Kontar,
Abstract要約: 被験者のストリームに対して最適なテストポリシーを求めるオンライン学習問題について検討する。対象に対するすべての候補テストを実行することで、より多くの情報が得られるが、サブセットのみを選択することが望ましい場合が多い。我々は、ミニマックスの後悔は少なくとも$Omega(Tfrac23)$としてスケールしなければならないことを証明し、エピソードMDPの$Theta(sqrtT)$レートとは対照的である。
参考スコア（独自算出の注目度）: 7.8024154978341365
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper studies an online learning problem that seeks optimal testing policies for a stream of subjects, each of whom can be evaluated through a sequence of candidate tests drawn from a common pool. We refer to this problem as the Online Testing Problem (OTP). Although conducting every candidate test for a subject provides more information, it is often preferable to select only a subset when tests are correlated and costly, and make decisions with partial information. If the joint distribution of test outcomes were known, the problem could be cast as a Markov Decision Process (MDP) and solved exactly. In practice, this distribution is unknown and must be learned online as subjects are tested. When a subject is not fully tested, the resulting missing data can bias estimates, making the problem fundamentally harder than standard episodic MDPs. We prove that the minimax regret must scale at least as $\Omega(T^{\frac{2}{3}})$, in contrast to the $\Theta(\sqrt{T})$ rate in episodic MDPs, revealing the difficulty introduced by missingness. This elevated lower bound is then matched by an Explore-Then-Commit algorithm whose cumulative regret is $\tilde{O}(T^{\frac{2}{3}})$ for both discrete and Gaussian distributions. To highlight the consequence of missingness-dependent rewards in OTP, we study a variant called the Online Cost-sensitive Maximum Entropy Sampling Problem, where rewards are independent of missing data. This structure enables an iterative-elimination algorithm that achieves $\tilde{O}(\sqrt{T})$ regret, breaking the $\Omega(T^{\frac{2}{3}})$ lower bound for OTP. Numerical results confirm our theory in both settings. Overall, this work deepens the understanding of the exploration--exploitation trade-off under missing data and guides the design of efficient sequential testing policies.
Abstract（参考訳）: 本稿では,複数の被験者に対して最適なテストポリシーを求めるオンライン学習問題について検討し,それぞれが共通のプールから抽出された候補テストのシーケンスを通じて評価できることを示した。この問題をオンラインテスト問題(OTP)と呼ぶ。被験者に対してすべての候補テストを実行することでより多くの情報が得られるが、テストが相関してコストがかかる場合にのみサブセットを選択し、部分的な情報で決定することが望ましい。テスト結果の連立分布が分かっていれば、問題はマルコフ決定過程(MDP)としてキャストされ、正確に解かれる。実際には、この分布は未知であり、被験者がテストされる際にはオンラインで学ぶ必要がある。被験者が完全にテストされていない場合、結果として得られたデータに偏りが生じるため、問題は標準のエピソードMDPよりも根本的に難しい。我々は、ミニマックスの後悔は少なくとも$\Omega(T^{\frac{2}{3}})$としてスケールしなければならないことを証明している。この高次の下界は、離散分布とガウス分布の両方に対して、累積後悔が$\tilde{O}(T^{\frac{2}{3}})$であるエクスプローラー・Then-Commitアルゴリズムで一致する。 OTPにおける損失依存報酬の結果を明らかにするため,オンラインコスト感性最大エントロピーサンプリング問題(Online Cost-sensitive Maximum Entropy Sampling Problem)と呼ばれる変種について検討した。この構造により、反復消去アルゴリズムが$\tilde{O}(\sqrt{T})$ regretを達成し、OTPの$\Omega(T^{\frac{2}{3}})$ lower boundを破ることができる。数値的な結果から,両設定で理論が立証される。全体として、この研究は、欠落したデータの下での探索-爆発的トレードオフの理解を深め、効率的なシーケンシャルなテストポリシーの設計を導く。

論文の概要: Online Learning of Optimal Sequential Testing Policies

関連論文リスト