Correcting discount-factor mismatch in on-policy policy gradient methods
- URL: http://arxiv.org/abs/2306.13284v1
- Date: Fri, 23 Jun 2023 04:10:58 GMT
- Title: Correcting discount-factor mismatch in on-policy policy gradient methods
- Authors: Fengdi Che, Gautham Vasan, A. Rupam Mahmood
- Abstract summary: We introduce a novel distribution correction to account for the discounted stationary distribution.
Our algorithm consistently matches or exceeds the original performance on several OpenAI gym and DeepMind suite benchmarks.
- Score: 2.9005223064604078
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The policy gradient theorem gives a convenient form of the policy gradient in
terms of three factors: an action value, a gradient of the action likelihood,
and a state distribution involving discounting called the \emph{discounted
stationary distribution}. But commonly used on-policy methods based on the
policy gradient theorem ignores the discount factor in the state distribution,
which is technically incorrect and may even cause degenerate learning behavior
in some environments. An existing solution corrects this discrepancy by using
$\gamma^t$ as a factor in the gradient estimate. However, this solution is not
widely adopted and does not work well in tasks where the later states are
similar to earlier states. We introduce a novel distribution correction to
account for the discounted stationary distribution that can be plugged into
many existing gradient estimators. Our correction circumvents the performance
degradation associated with the $\gamma^t$ correction with a lower variance.
Importantly, compared to the uncorrected estimators, our algorithm provides
improved state emphasis to evade suboptimal policies in certain environments
and consistently matches or exceeds the original performance on several OpenAI
gym and DeepMind suite benchmarks.
Related papers
- Towards Optimal Offline Reinforcement Learning [9.13232872223434]
We study offline reinforcement learning problems with a long-run average reward objective.
The state-action pairs generated by any fixed behavioral policy thus follow a Markov chain.
We use the rate function of this large deviations principle to construct an uncertainty set for the unknown em true state-action-next-state distribution.
arXiv Detail & Related papers (2025-03-15T22:41:55Z) - Distributionally Robust Policy Learning under Concept Drifts [33.44768994272614]
This paper studies a more nuanced problem -- robust policy learning under the concept drift.
We first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy.
We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class.
arXiv Detail & Related papers (2024-12-18T19:53:56Z) - Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions: Applications to Product-Form Stochastic Networks and Queueing Systems [1.747623282473278]
We introduce a policygradient method for model reinforcement learning (RL) that exploits a type of stationary distributions commonly obtained from decision processes (MDPs) in networks.
Specifically, when the stationary distribution of the MDP is parametrized by policy parameters, we can improve existing policy methods for average-reward estimation.
arXiv Detail & Related papers (2023-12-05T14:44:58Z) - Revisiting Estimation Bias in Policy Gradients for Deep Reinforcement
Learning [0.0]
We revisit the estimation bias in policy gradients for the discounted episodic Markov decision process (MDP) from Deep Reinforcement Learning perspective.
One of the major policy biases is the state distribution shift.
We show that, despite such state distribution shift, the policy gradient estimation bias can be reduced in the following three ways.
arXiv Detail & Related papers (2023-01-20T06:46:43Z) - The Role of Baselines in Policy Gradient Optimization [83.42050606055822]
We show that the emphstate value baseline allows on-policy.
emphnatural policy gradient (NPG) to converge to a globally optimal.
policy at an $O (1/t) rate gradient.
We find that the primary effect of the value baseline is to textbfreduce the aggressiveness of the updates rather than their variance.
arXiv Detail & Related papers (2023-01-16T06:28:00Z) - A Temporal-Difference Approach to Policy Gradient Estimation [27.749993205038148]
We propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy.
By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that sidesteps the distribution shift issue in a model-free way.
arXiv Detail & Related papers (2022-02-04T21:23:33Z) - Improper Learning with Gradient-based Policy Optimization [62.50997487685586]
We consider an improper reinforcement learning setting where the learner is given M base controllers for an unknown Markov Decision Process.
We propose a gradient-based approach that operates over a class of improper mixtures of the controllers.
arXiv Detail & Related papers (2021-02-16T14:53:55Z) - Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs)
Semi-implicit actor (SIA) powered by a flexible policy distribution.
We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.