Learning Thresholds with Latent Values and Censored Feedback
- URL: http://arxiv.org/abs/2312.04653v1
- Date: Thu, 7 Dec 2023 19:30:08 GMT
- Title: Learning Thresholds with Latent Values and Censored Feedback
- Authors: Jiahao Zhang, Tao Lin, Weiqiang Zheng, Zhe Feng, Yifeng Teng, Xiaotie
Deng
- Abstract summary: We show a problem where the unknown reward $g(gamma, v)$ depends on the proposed threshold $gamma$ and latent value $v$ and it can be $only$ achieved if the threshold is lower than or equal to the unknown latent value.
This problem has broad applications in practical scenarios, e.g., reserve price optimization in online auctions, online task assignments in crowdsourcing, setting recruiting bars in hiring.
- Score: 18.129896050051432
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we investigate a problem of actively learning threshold in
latent space, where the unknown reward $g(\gamma, v)$ depends on the proposed
threshold $\gamma$ and latent value $v$ and it can be $only$ achieved if the
threshold is lower than or equal to the unknown latent value. This problem has
broad applications in practical scenarios, e.g., reserve price optimization in
online auctions, online task assignments in crowdsourcing, setting recruiting
bars in hiring, etc. We first characterize the query complexity of learning a
threshold with the expected reward at most $\epsilon$ smaller than the optimum
and prove that the number of queries needed can be infinitely large even when
$g(\gamma, v)$ is monotone with respect to both $\gamma$ and $v$. On the
positive side, we provide a tight query complexity
$\tilde{\Theta}(1/\epsilon^3)$ when $g$ is monotone and the CDF of value
distribution is Lipschitz. Moreover, we show a tight
$\tilde{\Theta}(1/\epsilon^3)$ query complexity can be achieved as long as $g$
satisfies one-sided Lipschitzness, which provides a complete characterization
for this problem. Finally, we extend this model to an online learning setting
and demonstrate a tight $\Theta(T^{2/3})$ regret bound using continuous-arm
bandit techniques and the aforementioned query complexity results.
Related papers
- Beyond Minimax Rates in Group Distributionally Robust Optimization via a Novel Notion of Sparsity [14.396304498754688]
We show a novel notion of sparsity that we dub $(lambda, beta)$-sparsity.
We also show an adaptive algorithm which adapts to the best $(lambda, beta)$-sparsity condition that holds.
arXiv Detail & Related papers (2024-10-01T13:45:55Z) - Fast Rates for Bandit PAC Multiclass Classification [73.17969992976501]
We study multiclass PAC learning with bandit feedback, where inputs are classified into one of $K$ possible labels and feedback is limited to whether or not the predicted labels are correct.
Our main contribution is in designing a novel learning algorithm for the agnostic $(varepsilon,delta)$PAC version of the problem.
arXiv Detail & Related papers (2024-06-18T08:54:04Z) - Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost.
We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Adversarial Online Multi-Task Reinforcement Learning [12.421997449847153]
We consider the adversarial online multi-task reinforcement learning setting.
In each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models.
The learner's objective is to generalize its regret with respect to the optimal policy for each task.
arXiv Detail & Related papers (2023-01-11T02:18:26Z) - A Few Expert Queries Suffices for Sample-Efficient RL with Resets and
Linear Value Approximation [16.29514743112387]
We study sample-efficient Reinforcement Learning (RL) in settings where only the optimal value function is assumed to be linearlyrealizable.
We present a statistically and computationally efficient algorithm (Delphi) for blending exploration with expert queries.
Delphi requires $tildemathcalO(d)$ expert queries and a $textttpoly(d,|mathcalA|,1/varepsilon)$ amount of exploratory samples to provably recover an $varepsilon$suboptimal policy.
arXiv Detail & Related papers (2022-07-18T01:39:13Z) - Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement
Learning with Latent Low-Rank Structure [9.759209713196718]
We consider a class of MDPs for which the associated optimal $Q*$ function is low rank, where the latent features are unknown.
We show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LR-MCPI) and Low Rank Empirical Value Iteration (LR-EVI) achieve the desired sample complexity of $tildeOleft((|S|+|A|)mathrmpoly(d,H)/epsilon2right)$ for a rank
arXiv Detail & Related papers (2022-06-07T20:39:51Z) - Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations [79.66404989555566]
We consider the more realistic setting of agnostic RL with rich observation spaces and a fixed class of policies $Pi$ that may not contain any near-optimal policy.
We provide an algorithm for this setting whose error is bounded in terms of the rank $d$ of the underlying MDP.
arXiv Detail & Related papers (2021-06-22T03:20:40Z) - Improved Sample Complexity for Incremental Autonomous Exploration in
MDPs [132.88757893161699]
We learn the set of $epsilon$-optimal goal-conditioned policies attaining all states that are incrementally reachable within $L$ steps.
DisCo is the first algorithm that can return an $epsilon/c_min$-optimal policy for any cost-sensitive shortest-path problem.
arXiv Detail & Related papers (2020-12-29T14:06:09Z) - Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation [30.137884459159107]
We consider the question of learning $Q$-function in a sample efficient manner for reinforcement learning with continuous state and action spaces.
We develop a simple, iterative learning algorithm that finds $epsilon$-Schmidt $Q$-function with sample complexity of $widetildeO(frac1epsilonmax(d1), d_2)+2)$ when the optimal $Q$-function has low rank $r$ and the factor $$ is below a certain threshold.
arXiv Detail & Related papers (2020-06-11T00:55:35Z) - Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample
Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy.
For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.