Policy Gradient for Continuing Tasks in Non-stationary Markov Decision
Processes
- URL: http://arxiv.org/abs/2010.08443v1
- Date: Fri, 16 Oct 2020 15:15:42 GMT
- Title: Policy Gradient for Continuing Tasks in Non-stationary Markov Decision
Processes
- Authors: Santiago Paternain, Juan Andres Bazerque and Alejandro Ribeiro
- Abstract summary: Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities.
We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy.
A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
- Score: 112.38662246621969
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning considers the problem of finding policies that
maximize an expected cumulative reward in a Markov decision process with
unknown transition probabilities. In this paper we consider the problem of
finding optimal policies assuming that they belong to a reproducing kernel
Hilbert space (RKHS). To that end we compute unbiased stochastic gradients of
the value function which we use as ascent directions to update the policy. A
major drawback of policy gradient-type algorithms is that they are limited to
episodic tasks unless stationarity assumptions are imposed. Hence preventing
these algorithms to be fully implemented online, which is a desirable property
for systems that need to adapt to new tasks and/or environments in deployment.
The main requirement for a policy gradient algorithm to work is that the
estimate of the gradient at any point in time is an ascent direction for the
initial value function. In this work we establish that indeed this is the case
which enables to show the convergence of the online algorithm to the critical
points of the initial value function. A numerical example shows the ability of
our online algorithm to learn to solve a navigation and surveillance problem,
in which an agent must loop between to goal locations. This example
corroborates our theoretical findings about the ascent directions of subsequent
stochastic gradients. It also shows how the agent running our online algorithm
succeeds in learning to navigate, following a continuing cyclic trajectory that
does not comply with the standard stationarity assumptions in the literature
for non episodic training.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - The Reinforce Policy Gradient Algorithm Revisited [7.894349646617293]
We revisit the Reinforce policy gradient algorithm from the literature.
We propose a major enhancement to the basic algorithm.
We provide a proof of convergence for this new algorithm.
arXiv Detail & Related papers (2023-10-08T04:05:13Z) - Constrained Reinforcement Learning via Dissipative Saddle Flow Dynamics [5.270497591225775]
In constrained reinforcement learning (C-RL), an agent seeks to learn from the environment a policy that maximizes the expected cumulative reward.
Several algorithms rooted in sampled-based primal-dual methods have been recently proposed to solve this problem in policy space.
We propose a novel algorithm for constrained RL that does not suffer from these limitations.
arXiv Detail & Related papers (2022-12-03T01:54:55Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Chaining Value Functions for Off-Policy Learning [22.54793586116019]
We discuss a novel family of off-policy prediction algorithms which are convergent by construction.
We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix.
Empirically we evaluate the idea on challenging MDPs such as Baird's counter example and observe favourable results.
arXiv Detail & Related papers (2022-01-17T15:26:47Z) - Policy Gradient and Actor-Critic Learning in Continuous Time and Space:
Theory and Algorithms [1.776746672434207]
We study policy gradient (PG) for reinforcement learning in continuous time and space.
We propose two types of the actor-critic algorithms for RL, where we learn and update value functions and policies simultaneously and alternatingly.
arXiv Detail & Related papers (2021-11-22T14:27:04Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Deep Inverse Q-learning with Constraints [15.582910645906145]
We introduce a novel class of algorithms that only needs to solve the MDP underlying the demonstrated behavior once to recover the expert policy.
We show how to extend this class of algorithms to continuous state-spaces via function approximation and how to estimate a corresponding action-value function.
We evaluate the resulting algorithms called Inverse Action-value Iteration, Inverse Q-learning and Deep Inverse Q-learning on the Objectworld benchmark.
arXiv Detail & Related papers (2020-08-04T17:21:51Z) - Optimizing for the Future in Non-Stationary MDPs [52.373873622008944]
We present a policy gradient algorithm that maximizes a forecast of future performance.
We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques.
arXiv Detail & Related papers (2020-05-17T03:41:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.