On Gap-dependent Bounds for Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2206.00177v1
- Date: Wed, 1 Jun 2022 01:44:12 GMT
- Title: On Gap-dependent Bounds for Offline Reinforcement Learning
- Authors: Xinqi Wang, Qiwen Cui and Simon S. Du
- Abstract summary: This paper presents a systematic study on gap-dependent sample complexity in offline reinforcement learning.
Under the optimal policy coverage assumption, the rate can be improved to $Oleft(frac1epsilonright)$ when there is a positive sub-optimality gap in the optimal $Q$-function.
We show when the visitation probabilities of the behavior policy are uniformly lower bounded for states where an optimal policy's visitation probabilities are positive, the sample complexity of identifying an optimal policy is independent of $frac1epsilon$.
- Score: 40.92345387517103
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a systematic study on gap-dependent sample complexity in
offline reinforcement learning. Prior work showed when the density ratio
between an optimal policy and the behavior policy is upper bounded (the optimal
policy coverage assumption), then the agent can achieve an
$O\left(\frac{1}{\epsilon^2}\right)$ rate, which is also minimax optimal. We
show under the optimal policy coverage assumption, the rate can be improved to
$O\left(\frac{1}{\epsilon}\right)$ when there is a positive sub-optimality gap
in the optimal $Q$-function. Furthermore, we show when the visitation
probabilities of the behavior policy are uniformly lower bounded for states
where an optimal policy's visitation probabilities are positive (the uniform
optimal policy coverage assumption), the sample complexity of identifying an
optimal policy is independent of $\frac{1}{\epsilon}$. Lastly, we present
nearly-matching lower bounds to complement our gap-dependent upper bounds.
Related papers
- Nearly Optimal Sample Complexity of Offline KL-Regularized Contextual Bandits under Single-Policy Concentrability [49.96531901205305]
We propose the emphfirst algorithm with $tildeO(epsilon-1)$ sample complexity under single-policy concentrability for offline contextual bandits.
Our proof leverages the strong convexity of the KL regularization, and the conditional non-negativity of the gap between the true reward and its pessimistic estimator.
We extend our algorithm to contextual dueling bandits and achieve a similar nearly optimal sample complexity.
arXiv Detail & Related papers (2025-02-09T22:14:45Z) - Distributionally Robust Policy Learning under Concept Drifts [33.44768994272614]
This paper studies a more nuanced problem -- robust policy learning under the concept drift.
We first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy.
We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class.
arXiv Detail & Related papers (2024-12-18T19:53:56Z) - Theoretical guarantees on the best-of-n alignment policy [110.21094183592358]
We show that the KL divergence between the best-of-$n$ policy and the reference policy is an upper bound on the actual KL divergence.
We also propose a new estimator for the KL divergence and empirically show that it provides a tight approximation.
We conclude with analyzing the tradeoffs between win rate and KL divergence of the best-of-$n$ alignment policy.
arXiv Detail & Related papers (2024-01-03T18:39:13Z) - Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems.
The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy.
We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z) - The Role of Baselines in Policy Gradient Optimization [83.42050606055822]
We show that the emphstate value baseline allows on-policy.
emphnatural policy gradient (NPG) to converge to a globally optimal.
policy at an $O (1/t) rate gradient.
We find that the primary effect of the value baseline is to textbfreduce the aggressiveness of the updates rather than their variance.
arXiv Detail & Related papers (2023-01-16T06:28:00Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used.
Second, to explain these findings we introduce the concept of committal rate for policy optimization.
Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z) - Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation
for Reinforcement Learning [43.61029925616256]
offline policy evaluation in Reinforcement Learning (RL) is a critical step towards applying RL in real-life applications.
We address this problem by simultaneously evaluating all policies in a policy class $Pi$ -- uniform convergence in OPE.
Our results imply that the model-based planning achieves an optimal episode complexity of $widetildeO(H3/d_mepsilon2)$ in identifying an $epsilon$-optimal policy.
arXiv Detail & Related papers (2020-07-07T19:44:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.