Conservative DDPG -- Pessimistic RL without Ensemble
- URL: http://arxiv.org/abs/2403.05732v2
- Date: Sun, 2 Jun 2024 19:40:48 GMT
- Title: Conservative DDPG -- Pessimistic RL without Ensemble
- Authors: Nitsan Soffair, Shie Mannor,
- Abstract summary: DDPG is hindered by the overestimation bias problem.
Traditional solutions to this bias involve ensemble-based methods.
We propose a straightforward solution using a $Q$-target and incorporating a behavioral cloning (BC) loss penalty.
- Score: 48.61228614796803
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: DDPG is hindered by the overestimation bias problem, wherein its $Q$-estimates tend to overstate the actual $Q$-values. Traditional solutions to this bias involve ensemble-based methods, which require significant computational resources, or complex log-policy-based approaches, which are difficult to understand and implement. In contrast, we propose a straightforward solution using a $Q$-target and incorporating a behavioral cloning (BC) loss penalty. This solution, acting as an uncertainty measure, can be easily implemented with minimal code and without the need for an ensemble. Our empirical findings strongly support the superiority of Conservative DDPG over DDPG across various MuJoCo and Bullet tasks. We consistently observe better performance in all evaluated tasks and even competitive or superior performance compared to TD3 and TD7, all achieved with significantly reduced computational requirements.
Related papers
- Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z) - Cost Minimization for Space-Air-Ground Integrated Multi-Access Edge Computing Systems [60.586531406445744]
Space-air-ground integrated multi-altitude edge computing (SAGIN-MEC) provides a promising solution for the rapidly developing low-altitude economy.<n>We present a SAGIN-MEC architecture that enables the coordination between user devices (UDs), uncrewed aerial vehicles (UAVs) and satellites.
arXiv Detail & Related papers (2025-10-24T15:03:07Z) - Scalable Multi-Objective Robot Reinforcement Learning through Gradient Conflict Resolution [2.359524447776588]
We show how to resolve conflicts between task-based rewards and terms that regularise the policy towards realistic behaviour.<n>We propose GCR-PPO, a modification to actor-critic optimisation that decomposes the actor update into objective-wise gradients.<n>GCR-PPO improves on large-scale proximal policy optimisation with an average improvement of 9.5%, with high-conflict tasks observing a greater improvement.
arXiv Detail & Related papers (2025-09-18T10:18:07Z) - GDBA Revisited: Unleashing the Power of Guided Local Search for Distributed Constraint Optimization [23.069147641568467]
Local search is an important class of incomplete algorithms for solving Distributed Constraint Optimization Problems (DCOPs)<n>We propose Distributed Guided Local Search (DGLS), a novel GLS framework for DCOPs that incorporates an adaptive violation condition to selectively penalize constraints with high cost.<n>Our empirical results on various standard benchmarks demonstrate the great superiority of DGLS over state-of-the-art baselines.
arXiv Detail & Related papers (2025-08-09T09:12:06Z) - Provably Sample-Efficient Robust Reinforcement Learning with Average Reward [4.530028899565083]
We propose a new algorithm designed for robust Markov Decision Processes (MDPs) with transition uncertainty characterized by $ell_p$-norm and contamination models.<n>Our algorithm operates without requiring any prior knowledge of the robust MDP.<n>Our work provides essential theoretical understanding of sample efficiency of robust average reward RL.
arXiv Detail & Related papers (2025-05-18T15:34:45Z) - $β$-DQN: Improving Deep Q-Learning By Evolving the Behavior [41.13282452752521]
$beta$-DQN is a simple and efficient exploration method that augments the standard DQN with a behavior function.<n>An adaptive meta-controller is designed to select an effective policy for each episode, enabling flexible and explainable exploration.<n>Experiments on both simple and challenging exploration domains show that $beta$-DQN outperforms existing baseline methods.
arXiv Detail & Related papers (2025-01-01T18:12:18Z) - Scalable 3D Registration via Truncated Entry-wise Absolute Residuals [65.04922801371363]
A $3$D registration approach can process more than ten million ($107$) point pairs with over $99%$ random outliers.
We call our method TEAR, as it involves minimizing an outlier-robust loss that computes Truncated Entry-wise Absolute Residuals.
arXiv Detail & Related papers (2024-04-01T04:43:39Z) - Alternating Objectives Generates Stronger PGD-Based Adversarial Attacks [78.2700757742992]
Projected Gradient Descent (PGD) is one of the most effective and conceptually simple algorithms to generate such adversaries.
We experimentally verify this assertion on a synthetic-data example and by evaluating our proposed method across 25 different $ell_infty$-robust models and 3 datasets.
Our strongest adversarial attack outperforms all of the white-box components of AutoAttack ensemble.
arXiv Detail & Related papers (2022-12-15T17:44:31Z) - Dealing with Sparse Rewards in Continuous Control Robotics via
Heavy-Tailed Policies [64.2210390071609]
We present a novel Heavy-Tailed Policy Gradient (HT-PSG) algorithm to deal with the challenges of sparse rewards in continuous control problems.
We show consistent performance improvement across all tasks in terms of high average cumulative reward.
arXiv Detail & Related papers (2022-06-12T04:09:39Z) - Efficient Policy Iteration for Robust Markov Decision Processes via
Regularization [49.05403412954533]
Robust decision processes (MDPs) provide a framework to model decision problems where the system dynamics are changing or only partially known.
Recent work established the equivalence between texttts rectangular $L_p$ robust MDPs and regularized MDPs, and derived a regularized policy iteration scheme that enjoys the same level of efficiency as standard MDPs.
In this work, we focus on the policy improvement step and derive concrete forms for the greedy policy and the optimal robust Bellman operators.
arXiv Detail & Related papers (2022-05-28T04:05:20Z) - Revisiting Gaussian mixture critics in off-policy reinforcement
learning: a sample-based approach [28.199348547856175]
This paper revisits a natural alternative that removes the requirement of prior knowledge about the minimum and values a policy can attain.
It achieves state-of-the-art performance on a variety of challenging tasks.
arXiv Detail & Related papers (2022-04-21T16:44:47Z) - Implicitly Regularized RL with Implicit Q-Values [42.87920755961722]
The $Q$-function is a central quantity in many Reinforcement Learning (RL) algorithms for which RL agents behave following a (soft)-greedy policy.
We propose to parametrize the $Q$-function implicitly, as the sum of a log-policy and of a value function.
We derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the $Q$-value.
arXiv Detail & Related papers (2021-08-16T12:20:47Z) - Multiagent Rollout and Policy Iteration for POMDP with Application to
Multi-Robot Repair Problems [1.6939372704265414]
We consider infinite horizon discounted dynamic programming problems with finite state and control spaces, partial state observations, and a multiagent structure.
Our methods specifically address the computational challenges of partially observable multiagent problems.
arXiv Detail & Related papers (2020-11-09T06:51:50Z) - Queueing Network Controls via Deep Reinforcement Learning [0.0]
We develop a Proximal policy optimization algorithm for queueing networks.
The algorithm consistently generates control policies that outperform state-of-arts in literature.
A key to the successes of our PPO algorithm is the use of three variance reduction techniques in estimating the relative value function.
arXiv Detail & Related papers (2020-07-31T01:02:57Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z) - Zeroth-order Deterministic Policy Gradient [116.87117204825105]
We introduce Zeroth-order Deterministic Policy Gradient (ZDPG)
ZDPG approximates policy-reward gradients via two-point evaluations of the $Q$function.
New finite sample complexity bounds for ZDPG improve upon existing results by up to two orders of magnitude.
arXiv Detail & Related papers (2020-06-12T16:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.