How to Learn a Useful Critic? Model-based Action-Gradient-Estimator
Policy Optimization
- URL: http://arxiv.org/abs/2004.14309v2
- Date: Thu, 22 Oct 2020 14:24:34 GMT
- Title: How to Learn a Useful Critic? Model-based Action-Gradient-Estimator
Policy Optimization
- Authors: Pierluca D'Oro, Wojciech Ja\'skowski
- Abstract summary: We propose MAGE, a model-based actor-critic algorithm, grounded in the theory of policy gradients.
MAGE backpropagates through the learned dynamics to compute gradient targets in temporal difference learning.
We demonstrate the efficiency of the algorithm in comparison to model-free and model-based state-of-the-art baselines.
- Score: 10.424426548124696
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deterministic-policy actor-critic algorithms for continuous control improve
the actor by plugging its actions into the critic and ascending the
action-value gradient, which is obtained by chaining the actor's Jacobian
matrix with the gradient of the critic with respect to input actions. However,
instead of gradients, the critic is, typically, only trained to accurately
predict expected returns, which, on their own, are useless for policy
optimization. In this paper, we propose MAGE, a model-based actor-critic
algorithm, grounded in the theory of policy gradients, which explicitly learns
the action-value gradient. MAGE backpropagates through the learned dynamics to
compute gradient targets in temporal difference learning, leading to a critic
tailored for policy improvement. On a set of MuJoCo continuous-control tasks,
we demonstrate the efficiency of the algorithm in comparison to model-free and
model-based state-of-the-art baselines.
Related papers
- Compatible Gradient Approximations for Actor-Critic Algorithms [0.0]
We introduce an actor-critic algorithm that bypasses the need for such precision by employing a zerothorder approximation of the action-value gradient.
Empirical results demonstrate that our algorithm not only matches but frequently exceeds the performance of current state-of-the-art methods.
arXiv Detail & Related papers (2024-09-02T22:00:50Z) - Learning a Diffusion Model Policy from Rewards via Q-Score Matching [93.0191910132874]
We present a theoretical framework linking the structure of diffusion model policies to a learned Q-function.
We propose a new policy update method from this theory, which we denote Q-score matching.
arXiv Detail & Related papers (2023-12-18T23:31:01Z) - Decision-Aware Actor-Critic with Function Approximation and Theoretical
Guarantees [12.259191000019033]
Actor-critic (AC) methods are widely used in reinforcement learning (RL)
We design a joint objective for training the actor and critic in a decision-aware fashion.
We empirically demonstrate the benefit of our decision-aware actor-critic framework on simple RL problems.
arXiv Detail & Related papers (2023-05-24T15:34:21Z) - Actor-Critic learning for mean-field control in continuous time [0.0]
We study policy gradient for mean-field control in continuous time in a reinforcement learning setting.
By considering randomised policies with entropy regularisation, we derive a gradient expectation representation of the value function.
In the linear-quadratic mean-field framework, we obtain an exact parametrisation of the actor and critic functions defined on the Wasserstein space.
arXiv Detail & Related papers (2023-03-13T10:49:25Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Provable Benefits of Actor-Critic Methods for Offline Reinforcement
Learning [85.50033812217254]
Actor-critic methods are widely used in offline reinforcement learning practice, but are not so well-understood theoretically.
We propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle.
arXiv Detail & Related papers (2021-08-19T17:27:29Z) - Learning Value Functions in Deep Policy Gradients using Residual
Variance [22.414430270991005]
Policy gradient algorithms have proven to be successful in diverse decision making and control tasks.
Traditional actor-critic algorithms do not succeed in fitting the true value function.
We provide a new state-value (resp. state-action-value) function approximation that learns the value of the states relative to their mean value.
arXiv Detail & Related papers (2020-10-09T08:57:06Z) - Model-Augmented Actor-Critic: Backpropagating through Paths [81.86992776864729]
Current model-based reinforcement learning approaches use the model simply as a learned black-box simulator.
We show how to make more effective use of the model by exploiting its differentiability.
arXiv Detail & Related papers (2020-05-16T19:18:10Z) - Online Meta-Critic Learning for Off-Policy Actor-Critic Methods [107.98781730288897]
Off-Policy Actor-Critic (Off-PAC) methods have proven successful in a variety of continuous control tasks.
We introduce a novel and flexible meta-critic that observes the learning process and meta-learns an additional loss for the actor.
arXiv Detail & Related papers (2020-03-11T14:39:49Z) - Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension.
We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation.
These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.