Revisiting Gaussian mixture critics in off-policy reinforcement
learning: a sample-based approach
- URL: http://arxiv.org/abs/2204.10256v2
- Date: Fri, 22 Apr 2022 06:39:03 GMT
- Title: Revisiting Gaussian mixture critics in off-policy reinforcement
learning: a sample-based approach
- Authors: Bobak Shahriari, Abbas Abdolmaleki, Arunkumar Byravan, Abe Friesen,
Siqi Liu, Jost Tobias Springenberg, Nicolas Heess, Matt Hoffman, Martin
Riedmiller
- Abstract summary: This paper revisits a natural alternative that removes the requirement of prior knowledge about the minimum and values a policy can attain.
It achieves state-of-the-art performance on a variety of challenging tasks.
- Score: 28.199348547856175
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Actor-critic algorithms that make use of distributional policy evaluation
have frequently been shown to outperform their non-distributional counterparts
on many challenging control tasks. Examples of this behavior include the D4PG
and DMPO algorithms as compared to DDPG and MPO, respectively [Barth-Maron et
al., 2018; Hoffman et al., 2020]. However, both agents rely on the C51 critic
for value estimation.One major drawback of the C51 approach is its requirement
of prior knowledge about the minimum andmaximum values a policy can attain as
well as the number of bins used, which fixes the resolution ofthe
distributional estimate. While the DeepMind control suite of tasks utilizes
standardized rewards and episode lengths, thus enabling the entire suite to be
solved with a single setting of these hyperparameters, this is often not the
case. This paper revisits a natural alternative that removes this requirement,
namelya mixture of Gaussians, and a simple sample-based loss function to train
it in an off-policy regime. We empirically evaluate its performance on a broad
range of continuous control tasks and demonstrate that it eliminates the need
for these distributional hyperparameters and achieves state-of-the-art
performance on a variety of challenging tasks (e.g. the humanoid, dog,
quadruped, and manipulator domains). Finallywe provide an implementation in the
Acme agent repository.
Related papers
- Zeroth-Order Policy Gradient for Reinforcement Learning from Human
Feedback without Reward Inference [17.76565371753346]
This paper develops two RLHF algorithms without reward inference.
The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator.
Our results show there exist provably efficient methods to solve general RLHF problems without reward inference.
arXiv Detail & Related papers (2024-09-25T22:20:11Z) - Soft Actor-Critic with Beta Policy via Implicit Reparameterization Gradients [0.0]
Soft actor-critic (SAC) mitigates poor sample efficiency by combining policy optimization and off-policy learning.
It is limited to distributions whose gradients can be computed through the re parameterization trick.
We extend this technique to train SAC with the beta policy on simulated robot locomotion environments.
Experimental results show that the beta policy is a viable alternative, as it outperforms the normal policy and is on par with the normal policy.
arXiv Detail & Related papers (2024-09-08T04:30:51Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - Dealing with Sparse Rewards in Continuous Control Robotics via
Heavy-Tailed Policies [64.2210390071609]
We present a novel Heavy-Tailed Policy Gradient (HT-PSG) algorithm to deal with the challenges of sparse rewards in continuous control problems.
We show consistent performance improvement across all tasks in terms of high average cumulative reward.
arXiv Detail & Related papers (2022-06-12T04:09:39Z) - On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces [23.186300629667134]
We study the convergence of policy gradient algorithms under heavy-tailed parameterizations.
Our main theoretical contribution is the establishment that this scheme converges with constant step and batch sizes.
arXiv Detail & Related papers (2022-01-28T18:54:30Z) - Robust and Adaptive Temporal-Difference Learning Using An Ensemble of
Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning.
The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs.
To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z) - Controlling Conditional Language Models with Distributional Policy
Gradients [2.9176992922046923]
General-purpose pretrained generative models often fail to meet some of the downstream requirements.
This raises an important question on how to adapt pre-trained generative models to a new task without destroying its capabilities.
Recent work has suggested to solve this problem by representing task-specific requirements through energy-based models.
In this paper, we extend this approach to conditional tasks by proposing Conditional DPG (CDPG)
arXiv Detail & Related papers (2021-12-01T19:24:05Z) - Enhanced Scene Specificity with Sparse Dynamic Value Estimation [22.889059874754242]
Multi-scene reinforcement learning has become essential for many applications.
One strategy for variance reduction is to consider each scene as a distinct Markov decision process (MDP)
In this paper, we argue that the error between the true scene-specific value function and the predicted dynamic estimate can be further reduced by progressively enforcing sparse cluster assignments.
arXiv Detail & Related papers (2020-11-25T08:35:16Z) - MLE-guided parameter search for task loss minimization in neural
sequence modeling [83.83249536279239]
Neural autoregressive sequence models are used to generate sequences in a variety of natural language processing (NLP) tasks.
We propose maximum likelihood guided parameter search (MGS), which samples from a distribution over update directions that is a mixture of random search around the current parameters and around the maximum likelihood gradient.
Our experiments show that MGS is capable of optimizing sequence-level losses, with substantial reductions in repetition and non-termination in sequence completion, and similar improvements to those of minimum risk training in machine translation.
arXiv Detail & Related papers (2020-06-04T22:21:22Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.