Related papers: Exploiting Estimation Bias in Deep Double Q-Learning for Actor-Critic Methods

Exploiting Estimation Bias in Deep Double Q-Learning for Actor-Critic Methods

URL: http://arxiv.org/abs/2402.09078v1
Date: Wed, 14 Feb 2024 10:44:03 GMT
Title: Exploiting Estimation Bias in Deep Double Q-Learning for Actor-Critic Methods
Authors: Alberto Sinigaglia, Niccol\`o Turcato, Alberto Dalla Libera, Ruggero Carli, Gian Antonio Susto
Abstract summary: We propose two novel algorithms: Expectile Delayed Deep Deterministic Policy Gradient (ExpD3) and Bias Exploiting - Twin Delayed Deep Deterministic Policy Gradient (BE-TD3) ExpD3 aims to reduce overestimation bias with a single $Q$ estimate, while BE-TD3 is designed to dynamically select the most advantageous estimation bias during training. We show that these algorithms can either match or surpass existing methods like TD3, particularly in environments where estimation biases significantly impact learning.
Score: 6.403512866289237
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces innovative methods in Reinforcement Learning (RL), focusing on addressing and exploiting estimation biases in Actor-Critic methods for continuous control tasks, using Deep Double Q-Learning. We propose two novel algorithms: Expectile Delayed Deep Deterministic Policy Gradient (ExpD3) and Bias Exploiting - Twin Delayed Deep Deterministic Policy Gradient (BE-TD3). ExpD3 aims to reduce overestimation bias with a single $Q$ estimate, offering a balance between computational efficiency and performance, while BE-TD3 is designed to dynamically select the most advantageous estimation bias during training. Our extensive experiments across various continuous control tasks demonstrate the effectiveness of our approaches. We show that these algorithms can either match or surpass existing methods like TD3, particularly in environments where estimation biases significantly impact learning. The results underline the importance of bias exploitation in improving policy learning in RL.

Related papers

Diffusion Guidance Is a Controllable Policy Improvement Operator [98.11511661904618]
CFGRL is trained with the simplicity of supervised learning, yet can further improve on the policies in the data.<n>On offline RL tasks, we observe a reliable trend -- increased guidance weighting leads to increased performance.
arXiv Detail & Related papers (2025-05-29T14:06:50Z)
A Comprehensive Survey of Reinforcement Learning: From Algorithms to Practical Challenges [2.2448567386846916]
Reinforcement Learning (RL) has emerged as a powerful paradigm in Artificial Intelligence (AI) This paper presents a comprehensive survey of RL, meticulously analyzing a wide range of algorithms. We offer practical insights into the selection and implementation of RL algorithms, addressing common challenges like convergence, stability, and the exploration-exploitation dilemma.
arXiv Detail & Related papers (2024-11-28T03:53:14Z)
Improving Deep Reinforcement Learning by Reducing the Chain Effect of Value and Policy Churn [14.30387204093346]
Deep neural networks provide Reinforcement Learning (RL) powerful function approximators to address large-scale decision-making problems. One source of the challenges in RL is that output predictions can churn, leading to uncontrolled changes after each batch update for states not included in the batch. We propose a method to reduce the chain effect across different settings, called Churn Approximated ReductIoN (CHAIN), which can be easily plugged into most existing DRL algorithms.
arXiv Detail & Related papers (2024-09-07T11:08:20Z)
Hybrid Inverse Reinforcement Learning [34.793570631021005]
inverse reinforcement learning approach to imitation learning is a double-edged sword. We propose using hybrid RL -- training on a mixture of online and expert data -- to curtail unnecessary exploration. We derive both model-free and model-based hybrid inverse RL algorithms with strong policy performance guarantees.
arXiv Detail & Related papers (2024-02-13T23:29:09Z)
Efficient Deep Reinforcement Learning Requires Regulating Overfitting [91.88004732618381]
We show that high temporal-difference (TD) error on the validation set of transitions is the main culprit that severely affects the performance of deep RL algorithms. We show that a simple online model selection method that targets the validation TD error is effective across state-based DMC and Gym tasks.
arXiv Detail & Related papers (2023-04-20T17:11:05Z)
Flexible Attention-Based Multi-Policy Fusion for Efficient Deep Reinforcement Learning [78.31888150539258]
Reinforcement learning (RL) agents have long sought to approach the efficiency of human learning. Prior studies in RL have incorporated external knowledge policies to help agents improve sample efficiency. We present Knowledge-Grounded RL (KGRL), an RL paradigm fusing multiple knowledge policies and aiming for human-like efficiency and flexibility.
arXiv Detail & Related papers (2022-10-07T17:56:57Z)
Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy. In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks. We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z)
Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration. Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states. We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z)
Deep RL With Information Constrained Policies: Generalization in Continuous Control [21.46148507577606]
We show that a natural constraint on information flow might confer onto artificial agents in continuous control tasks. We implement a novel Capacity-Limited Actor-Critic (CLAC) algorithm. Our experiments show that compared to alternative approaches, CLAC offers improvements in generalization between training and modified test environments.
arXiv Detail & Related papers (2020-10-09T15:42:21Z)
Dynamics Generalization via Information Bottleneck in Deep Reinforcement Learning [90.93035276307239]
We propose an information theoretic regularization objective and an annealing-based optimization method to achieve better generalization ability in RL agents. We demonstrate the extreme generalization benefits of our approach in different domains ranging from maze navigation to robotic tasks. This work provides a principled way to improve generalization in RL by gradually removing information that is redundant for task-solving.
arXiv Detail & Related papers (2020-08-03T02:24:20Z)
Self-Paced Deep Reinforcement Learning [42.467323141301826]
Curriculum reinforcement learning (CRL) improves the learning speed and stability of an agent by exposing it to a tailored series of tasks throughout learning. Despite empirical successes, an open question in CRL is how to automatically generate a curriculum for a given reinforcement learning (RL) agent, avoiding manual design. We propose an answer by interpreting the curriculum generation as an inference problem, where distributions over tasks are progressively learned to approach the target task. This approach leads to an automatic curriculum generation, whose pace is controlled by the agent, with solid theoretical motivation and easily integrated with deep RL algorithms.
arXiv Detail & Related papers (2020-04-24T15:48:07Z)
Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension. We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation. These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.