Policy Gradient and Actor-Critic Learning in Continuous Time and Space:
Theory and Algorithms
- URL: http://arxiv.org/abs/2111.11232v1
- Date: Mon, 22 Nov 2021 14:27:04 GMT
- Title: Policy Gradient and Actor-Critic Learning in Continuous Time and Space:
Theory and Algorithms
- Authors: Yanwei Jia and Xun Yu Zhou
- Abstract summary: We study policy gradient (PG) for reinforcement learning in continuous time and space.
We propose two types of the actor-critic algorithms for RL, where we learn and update value functions and policies simultaneously and alternatingly.
- Score: 1.776746672434207
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study policy gradient (PG) for reinforcement learning in continuous time
and space under the regularized exploratory formulation developed by Wang et
al. (2020). We represent the gradient of the value function with respect to a
given parameterized stochastic policy as the expected integration of an
auxiliary running reward function that can be evaluated using samples and the
current value function. This effectively turns PG into a policy evaluation (PE)
problem, enabling us to apply the martingale approach recently developed by Jia
and Zhou (2021) for PE to solve our PG problem. Based on this analysis, we
propose two types of the actor-critic algorithms for RL, where we learn and
update value functions and policies simultaneously and alternatingly. The first
type is based directly on the aforementioned representation which involves
future trajectories and hence is offline. The second type, designed for online
learning, employs the first-order condition of the policy gradient and turns it
into martingale orthogonality conditions. These conditions are then
incorporated using stochastic approximation when updating policies. Finally, we
demonstrate the algorithms by simulations in two concrete examples.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Actor-Critic learning for mean-field control in continuous time [0.0]
We study policy gradient for mean-field control in continuous time in a reinforcement learning setting.
By considering randomised policies with entropy regularisation, we derive a gradient expectation representation of the value function.
In the linear-quadratic mean-field framework, we obtain an exact parametrisation of the actor and critic functions defined on the Wasserstein space.
arXiv Detail & Related papers (2023-03-13T10:49:25Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - The Role of Lookahead and Approximate Policy Evaluation in Policy
Iteration with Linear Value Function Approximation [14.528756508275622]
We show that when linear function approximation is used to represent the value function, a certain minimum amount of lookahead and multi-step return is needed.
And when this condition is met, we characterize the finite-time performance of policies obtained using such approximate policy.
arXiv Detail & Related papers (2021-09-28T01:20:08Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Approximate Midpoint Policy Iteration for Linear Quadratic Control [1.0312968200748118]
We present a midpoint policy iteration algorithm to solve linear quadratic optimal control problems in both model-based and model-free settings.
We show that in the model-based setting it achieves cubic convergence, which is superior to standard policy iteration and policy algorithms that achieve quadratic and linear convergence.
arXiv Detail & Related papers (2020-11-28T20:22:10Z) - Policy Gradient for Continuing Tasks in Non-stationary Markov Decision
Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities.
We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy.
A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z) - Optimizing for the Future in Non-Stationary MDPs [52.373873622008944]
We present a policy gradient algorithm that maximizes a forecast of future performance.
We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques.
arXiv Detail & Related papers (2020-05-17T03:41:19Z) - Formal Policy Synthesis for Continuous-Space Systems via Reinforcement
Learning [0.0]
We show how reinforcement learning can be applied for computing policies that are finite-memory and deterministic.
We develop the required assumptions and theories for the convergence of the learned policy to the optimal policy.
We demonstrate the approach on a 4-dim cart-pole system and 6-dim boat driving problem.
arXiv Detail & Related papers (2020-05-04T08:36:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.