Exploiting Estimation Bias in Deep Double Q-Learning for Actor-Critic
Methods
- URL: http://arxiv.org/abs/2402.09078v1
- Date: Wed, 14 Feb 2024 10:44:03 GMT
- Title: Exploiting Estimation Bias in Deep Double Q-Learning for Actor-Critic
Methods
- Authors: Alberto Sinigaglia, Niccol\`o Turcato, Alberto Dalla Libera, Ruggero
Carli, Gian Antonio Susto
- Abstract summary: We propose two novel algorithms: Expectile Delayed Deep Deterministic Policy Gradient (ExpD3) and Bias Exploiting - Twin Delayed Deep Deterministic Policy Gradient (BE-TD3)
ExpD3 aims to reduce overestimation bias with a single $Q$ estimate, while BE-TD3 is designed to dynamically select the most advantageous estimation bias during training.
We show that these algorithms can either match or surpass existing methods like TD3, particularly in environments where estimation biases significantly impact learning.
- Score: 6.403512866289237
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces innovative methods in Reinforcement Learning (RL),
focusing on addressing and exploiting estimation biases in Actor-Critic methods
for continuous control tasks, using Deep Double Q-Learning. We propose two
novel algorithms: Expectile Delayed Deep Deterministic Policy Gradient (ExpD3)
and Bias Exploiting - Twin Delayed Deep Deterministic Policy Gradient (BE-TD3).
ExpD3 aims to reduce overestimation bias with a single $Q$ estimate, offering a
balance between computational efficiency and performance, while BE-TD3 is
designed to dynamically select the most advantageous estimation bias during
training. Our extensive experiments across various continuous control tasks
demonstrate the effectiveness of our approaches. We show that these algorithms
can either match or surpass existing methods like TD3, particularly in
environments where estimation biases significantly impact learning. The results
underline the importance of bias exploitation in improving policy learning in
RL.
Related papers
- On the Theory of Risk-Aware Agents: Bridging Actor-Critic and Economics [0.7655800373514546]
Risk-aware Reinforcement Learning algorithms were shown to outperform risk-neutral counterparts in a variety of continuous-action tasks.
The theoretical basis for the pessimistic objectives these algorithms employ remains unestablished.
We propose Dual Actor-Critic (DAC) as a risk-aware, model-free algorithm that features two distinct actor networks.
arXiv Detail & Related papers (2023-10-30T13:28:06Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Simultaneous Double Q-learning with Conservative Advantage Learning for
Actor-Critic Methods [133.85604983925282]
We propose Simultaneous Double Q-learning with Conservative Advantage Learning (SDQ-CAL)
Our algorithm realizes less biased value estimation and achieves state-of-the-art performance in a range of continuous control benchmark tasks.
arXiv Detail & Related papers (2022-05-08T09:17:16Z) - Deterministic and Discriminative Imitation (D2-Imitation): Revisiting
Adversarial Imitation for Sample Efficiency [61.03922379081648]
We propose an off-policy sample efficient approach that requires no adversarial training or min-max optimization.
Our empirical results show that D2-Imitation is effective in achieving good sample efficiency, outperforming several off-policy extension approaches of adversarial imitation.
arXiv Detail & Related papers (2021-12-11T19:36:19Z) - AWD3: Dynamic Reduction of the Estimation Bias [0.0]
We introduce a technique that eliminates the estimation bias in off-policy continuous control algorithms using the experience replay mechanism.
We show through continuous control environments of OpenAI gym that our algorithm matches or outperforms the state-of-the-art off-policy policy gradient learning algorithms.
arXiv Detail & Related papers (2021-11-12T15:46:19Z) - Estimation Error Correction in Deep Reinforcement Learning for
Deterministic Actor-Critic Methods [0.0]
In value-based deep reinforcement learning methods, approximation of value functions induces overestimation bias and leads to suboptimal policies.
We show that in deep actor-critic methods that aim to overcome the overestimation bias, if the reinforcement signals received by the agent have a high variance, a significant underestimation bias arises.
To minimize the underestimation, we introduce a parameter-free, novel deep Q-learning variant.
arXiv Detail & Related papers (2021-09-22T13:49:35Z) - Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration.
Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states.
We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z) - Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods.
Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z) - Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension.
We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation.
These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.