Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step
Q-learning: A Novel Correction Approach
- URL: http://arxiv.org/abs/2208.00755v4
- Date: Mon, 25 Sep 2023 19:21:25 GMT
- Title: Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step
Q-learning: A Novel Correction Approach
- Authors: Baturay Saglam, Dogan C. Cicek, Furkan B. Mutlu, Suleyman S. Kozat
- Abstract summary: We introduce a novel policy similarity measure to mitigate the effects of such discrepancy in continuous control.
Our method offers an adequate single-step off-policy correction that is applicable to deterministic policy networks.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Compared to on-policy counterparts, off-policy model-free deep reinforcement
learning can improve data efficiency by repeatedly using the previously
gathered data. However, off-policy learning becomes challenging when the
discrepancy between the underlying distributions of the agent's policy and
collected data increases. Although the well-studied importance sampling and
off-policy policy gradient techniques were proposed to compensate for this
discrepancy, they usually require a collection of long trajectories and induce
additional problems such as vanishing/exploding gradients or discarding many
useful experiences, which eventually increases the computational complexity.
Moreover, their generalization to either continuous action domains or policies
approximated by deterministic deep neural networks is strictly limited. To
overcome these limitations, we introduce a novel policy similarity measure to
mitigate the effects of such discrepancy in continuous control. Our method
offers an adequate single-step off-policy correction that is applicable to
deterministic policy networks. Theoretical and empirical studies demonstrate
that it can achieve a "safe" off-policy learning and substantially improve the
state-of-the-art by attaining higher returns in fewer steps than the competing
methods through an effective schedule of the learning rate in Q-learning and
policy optimization.
Related papers
- Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - Time-Efficient Reinforcement Learning with Stochastic Stateful Policies [20.545058017790428]
We present a novel approach for training stateful policies by decomposing the latter into a gradient internal state kernel and a stateless policy.
We introduce different versions of the stateful policy gradient theorem, enabling us to easily instantiate stateful variants of popular reinforcement learning algorithms.
arXiv Detail & Related papers (2023-11-07T15:48:07Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - State Augmented Constrained Reinforcement Learning: Overcoming the
Limitations of Learning with Rewards [88.30521204048551]
A common formulation of constrained reinforcement learning involves multiple rewards that must individually accumulate to given thresholds.
We show a simple example in which the desired optimal policy cannot be induced by any weighted linear combination of rewards.
This work addresses this shortcoming by augmenting the state with Lagrange multipliers and reinterpreting primal-dual methods.
arXiv Detail & Related papers (2021-02-23T21:07:35Z) - Batch Reinforcement Learning with a Nonparametric Off-Policy Policy
Gradient [34.16700176918835]
Off-policy Reinforcement Learning holds the promise of better data efficiency.
Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates.
We propose a nonparametric Bellman equation, which can be solved in closed form.
arXiv Detail & Related papers (2020-10-27T13:40:06Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - DDPG++: Striving for Simplicity in Continuous-control Off-Policy
Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled.
Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step.
Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z) - Population-Guided Parallel Policy Search for Reinforcement Learning [17.360163137926]
A new population-guided parallel learning scheme is proposed to enhance the performance of off-policy reinforcement learning (RL)
In the proposed scheme, multiple identical learners with their own value-functions and policies share a common experience replay buffer, and search a good policy in collaboration with the guidance of the best policy information.
arXiv Detail & Related papers (2020-01-09T10:13:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.