Actor-Critic learning for mean-field control in continuous time
- URL: http://arxiv.org/abs/2303.06993v1
- Date: Mon, 13 Mar 2023 10:49:25 GMT
- Title: Actor-Critic learning for mean-field control in continuous time
- Authors: Noufel Frikha (UP1 UFR27), Maximilien Germain (LPSM (UMR\_8001)),
Mathieu Lauri\`ere, Huy\^en Pham (LPSM (UMR\_8001)), Xuanye Song (LPSM
(UMR\_8001))
- Abstract summary: We study policy gradient for mean-field control in continuous time in a reinforcement learning setting.
By considering randomised policies with entropy regularisation, we derive a gradient expectation representation of the value function.
In the linear-quadratic mean-field framework, we obtain an exact parametrisation of the actor and critic functions defined on the Wasserstein space.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study policy gradient for mean-field control in continuous time in a
reinforcement learning setting. By considering randomised policies with entropy
regularisation, we derive a gradient expectation representation of the value
function, which is amenable to actor-critic type algorithms, where the value
functions and the policies are learnt alternately based on observation samples
of the state and model-free estimation of the population state distribution,
either by offline or online learning. In the linear-quadratic mean-field
framework, we obtain an exact parametrisation of the actor and critic functions
defined on the Wasserstein space. Finally, we illustrate the results of our
algorithms with some numerical experiments on concrete examples.
Related papers
- Full error analysis of policy gradient learning algorithms for exploratory linear quadratic mean-field control problem in continuous time with common noise [0.0]
We study policy gradient (PG) learning and first demonstrate convergence in a model-based setting.
We prove the global linear convergence and sample complexity of the PG algorithm with two-point gradient estimates in a model-free setting.
In this setting, the parameterized optimal policies are learned from samples of the states and population distribution.
arXiv Detail & Related papers (2024-08-05T14:11:51Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Actor critic learning algorithms for mean-field control with moment
neural networks [0.0]
We develop a new policy gradient and actor-critic algorithm for solving mean-field control problems.
The learning for both the actor (policy) and critic (value function) is facilitated by a class of moment neural network functions.
arXiv Detail & Related papers (2023-09-08T13:29:57Z) - Policy Gradient and Actor-Critic Learning in Continuous Time and Space:
Theory and Algorithms [1.776746672434207]
We study policy gradient (PG) for reinforcement learning in continuous time and space.
We propose two types of the actor-critic algorithms for RL, where we learn and update value functions and policies simultaneously and alternatingly.
arXiv Detail & Related papers (2021-11-22T14:27:04Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Policy Gradient for Continuing Tasks in Non-stationary Markov Decision
Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities.
We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy.
A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Model-Augmented Actor-Critic: Backpropagating through Paths [81.86992776864729]
Current model-based reinforcement learning approaches use the model simply as a learned black-box simulator.
We show how to make more effective use of the model by exploiting its differentiability.
arXiv Detail & Related papers (2020-05-16T19:18:10Z) - How to Learn a Useful Critic? Model-based Action-Gradient-Estimator
Policy Optimization [10.424426548124696]
We propose MAGE, a model-based actor-critic algorithm, grounded in the theory of policy gradients.
MAGE backpropagates through the learned dynamics to compute gradient targets in temporal difference learning.
We demonstrate the efficiency of the algorithm in comparison to model-free and model-based state-of-the-art baselines.
arXiv Detail & Related papers (2020-04-29T16:30:53Z) - Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement
Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare.
We develop an approach that estimates the bounds of a given policy.
We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.