Revisiting Estimation Bias in Policy Gradients for Deep Reinforcement
Learning
- URL: http://arxiv.org/abs/2301.08442v1
- Date: Fri, 20 Jan 2023 06:46:43 GMT
- Title: Revisiting Estimation Bias in Policy Gradients for Deep Reinforcement
Learning
- Authors: Haoxuan Pan (1 and 2), Deheng Ye (2), Xiaoming Duan (1), Qiang Fu (2),
Wei Yang (2), Jianping He (1), Mingfei Sun (3) ((1) Shanghai Jiaotong
University, (2) Tencent Inc, (3) The University of Manchester)
- Abstract summary: We revisit the estimation bias in policy gradients for the discounted episodic Markov decision process (MDP) from Deep Reinforcement Learning perspective.
One of the major policy biases is the state distribution shift.
We show that, despite such state distribution shift, the policy gradient estimation bias can be reduced in the following three ways.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We revisit the estimation bias in policy gradients for the discounted
episodic Markov decision process (MDP) from Deep Reinforcement Learning (DRL)
perspective. The objective is formulated theoretically as the expected returns
discounted over the time horizon. One of the major policy gradient biases is
the state distribution shift: the state distribution used to estimate the
gradients differs from the theoretical formulation in that it does not take
into account the discount factor. Existing discussion of the influence of this
bias was limited to the tabular and softmax cases in the literature. Therefore,
in this paper, we extend it to the DRL setting where the policy is
parameterized and demonstrate how this bias can lead to suboptimal policies
theoretically. We then discuss why the empirically inaccurate implementations
with shifted state distribution can still be effective. We show that, despite
such state distribution shift, the policy gradient estimation bias can be
reduced in the following three ways: 1) a small learning rate; 2) an
adaptive-learning-rate-based optimizer; and 3) KL regularization. Specifically,
we show that a smaller learning rate, or, an adaptive learning rate, such as
that used by Adam and RSMProp optimizers, makes the policy optimization robust
to the bias. We further draw connections between optimizers and the
optimization regularization to show that both the KL and the reverse KL
regularization can significantly rectify this bias. Moreover, we provide
extensive experiments on continuous control tasks to support our analysis. Our
paper sheds light on how successful PG algorithms optimize policies in the DRL
setting, and contributes insights into the practical issues in DRL.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - The Role of Baselines in Policy Gradient Optimization [83.42050606055822]
We show that the emphstate value baseline allows on-policy.
emphnatural policy gradient (NPG) to converge to a globally optimal.
policy at an $O (1/t) rate gradient.
We find that the primary effect of the value baseline is to textbfreduce the aggressiveness of the updates rather than their variance.
arXiv Detail & Related papers (2023-01-16T06:28:00Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Beyond variance reduction: Understanding the true impact of baselines on
policy optimization [24.09670734037029]
We show that learning dynamics are governed by the curvature of the loss function and the noise of the gradient estimates.
We present theoretical results showing that, at least for bandit problems, curvature and noise are not sufficient to explain the learning dynamics.
arXiv Detail & Related papers (2020-08-31T17:52:09Z) - PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient
Learning [35.044047991893365]
This work introduces the the Policy Cover-Policy Gradient (PC-PG) algorithm, which balances the exploration vs. exploitation tradeoff using an ensemble of policies (the policy cover)
We show that PC-PG has strong guarantees under model misspecification that go beyond the standard worst case $ell_infty$ assumptions.
We also complement the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.
arXiv Detail & Related papers (2020-07-16T16:57:41Z) - Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks.
Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes.
We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.