Related papers: How to Learn a Useful Critic? Model-based Action-Gradient-Estimator Policy Optimization

How to Learn a Useful Critic? Model-based Action-Gradient-Estimator Policy Optimization

URL: http://arxiv.org/abs/2004.14309v2
Date: Thu, 22 Oct 2020 14:24:34 GMT
Title: How to Learn a Useful Critic? Model-based Action-Gradient-Estimator Policy Optimization
Authors: Pierluca D'Oro, Wojciech Ja\'skowski
Abstract summary: We propose MAGE, a model-based actor-critic algorithm, grounded in the theory of policy gradients. MAGE backpropagates through the learned dynamics to compute gradient targets in temporal difference learning. We demonstrate the efficiency of the algorithm in comparison to model-free and model-based state-of-the-art baselines.
Score: 10.424426548124696
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deterministic-policy actor-critic algorithms for continuous control improve the actor by plugging its actions into the critic and ascending the action-value gradient, which is obtained by chaining the actor's Jacobian matrix with the gradient of the critic with respect to input actions. However, instead of gradients, the critic is, typically, only trained to accurately predict expected returns, which, on their own, are useless for policy optimization. In this paper, we propose MAGE, a model-based actor-critic algorithm, grounded in the theory of policy gradients, which explicitly learns the action-value gradient. MAGE backpropagates through the learned dynamics to compute gradient targets in temporal difference learning, leading to a critic tailored for policy improvement. On a set of MuJoCo continuous-control tasks, we demonstrate the efficiency of the algorithm in comparison to model-free and model-based state-of-the-art baselines.

Related papers

IL-SOAR : Imitation Learning with Soft Optimistic Actor cRitic [52.44637913176449]
This paper introduces the SOAR framework for imitation learning. It is an algorithmic template that learns a policy from expert demonstrations with a primal dual style algorithm that alternates cost and policy updates. It is shown to boost consistently the performance of imitation learning algorithms based on Soft Actor Critic such as f-IRL, ML-IRL and CSIL in several MuJoCo environments.
arXiv Detail & Related papers (2025-02-27T08:03:37Z)
Compatible Gradient Approximations for Actor-Critic Algorithms [0.0]
We introduce an actor-critic algorithm that bypasses the need for such precision by employing a zerothorder approximation of the action-value gradient. Empirical results demonstrate that our algorithm not only matches but frequently exceeds the performance of current state-of-the-art methods.
arXiv Detail & Related papers (2024-09-02T22:00:50Z)
Learning a Diffusion Model Policy from Rewards via Q-Score Matching [93.0191910132874]
We present a theoretical framework linking the structure of diffusion model policies to a learned Q-function. We propose a new policy update method from this theory, which we denote Q-score matching.
arXiv Detail & Related papers (2023-12-18T23:31:01Z)
Decision-Aware Actor-Critic with Function Approximation and Theoretical Guarantees [12.259191000019033]
Actor-critic (AC) methods are widely used in reinforcement learning (RL) We design a joint objective for training the actor and critic in a decision-aware fashion. We empirically demonstrate the benefit of our decision-aware actor-critic framework on simple RL problems.
arXiv Detail & Related papers (2023-05-24T15:34:21Z)
Actor-Critic learning for mean-field control in continuous time [0.0]
We study policy gradient for mean-field control in continuous time in a reinforcement learning setting. By considering randomised policies with entropy regularisation, we derive a gradient expectation representation of the value function. In the linear-quadratic mean-field framework, we obtain an exact parametrisation of the actor and critic functions defined on the Wasserstein space.
arXiv Detail & Related papers (2023-03-13T10:49:25Z)
Off-policy Reinforcement Learning with Optimistic Exploration and Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework. To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z)
Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning [85.50033812217254]
Actor-critic methods are widely used in offline reinforcement learning practice, but are not so well-understood theoretically. We propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle.
arXiv Detail & Related papers (2021-08-19T17:27:29Z)
Learning Value Functions in Deep Policy Gradients using Residual Variance [22.414430270991005]
Policy gradient algorithms have proven to be successful in diverse decision making and control tasks. Traditional actor-critic algorithms do not succeed in fitting the true value function. We provide a new state-value (resp. state-action-value) function approximation that learns the value of the states relative to their mean value.
arXiv Detail & Related papers (2020-10-09T08:57:06Z)
Model-Augmented Actor-Critic: Backpropagating through Paths [81.86992776864729]
Current model-based reinforcement learning approaches use the model simply as a learned black-box simulator. We show how to make more effective use of the model by exploiting its differentiability.
arXiv Detail & Related papers (2020-05-16T19:18:10Z)
Online Meta-Critic Learning for Off-Policy Actor-Critic Methods [107.98781730288897]
Off-Policy Actor-Critic (Off-PAC) methods have proven successful in a variety of continuous control tasks. We introduce a novel and flexible meta-critic that observes the learning process and meta-learns an additional loss for the actor.
arXiv Detail & Related papers (2020-03-11T14:39:49Z)
Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension. We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation. These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.