Related papers: Smoothed functional-based gradient algorithms for off-policy reinforcement learning: A non-asymptotic viewpoint

Smoothed functional-based gradient algorithms for off-policy reinforcement learning: A non-asymptotic viewpoint

URL: http://arxiv.org/abs/2101.02137v6
Date: Sun, 23 Jun 2024 18:34:27 GMT
Title: Smoothed functional-based gradient algorithms for off-policy reinforcement learning: A non-asymptotic viewpoint
Authors: Nithia Vijayan, Prashanth L. A,
Abstract summary: We propose two policy gradient algorithms for solving the problem of control in an off-policy reinforcement learning context. Both algorithms incorporate a smoothed functional (SF) based gradient estimation scheme.
Score: 8.087699764574788
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose two policy gradient algorithms for solving the problem of control in an off-policy reinforcement learning (RL) context. Both algorithms incorporate a smoothed functional (SF) based gradient estimation scheme. The first algorithm is a straightforward combination of importance sampling-based off-policy evaluation with SF-based gradient estimation. The second algorithm, inspired by the stochastic variance-reduced gradient (SVRG) algorithm, incorporates variance reduction in the update iteration. For both algorithms, we derive non-asymptotic bounds that establish convergence to an approximate stationary point. From these results, we infer that the first algorithm converges at a rate that is comparable to the well-known REINFORCE algorithm in an off-policy RL context, while the second algorithm exhibits an improved rate of convergence.

Related papers

A Cubic-regularized Policy Newton Algorithm for Reinforcement Learning [9.628032156001073]
We propose two policy Newton algorithms that incorporate cubic regularization. Both algorithms employ the likelihood ratio method to form estimates of the gradient and Hessian of the value function. In particular, the sample complexity of our algorithms to find an $epsilon$-SOSP is $O(epsilon-3.5)$, which is an improvement over the state-of-the-art sample complexity of $O(epsilon-4.5)$.
arXiv Detail & Related papers (2023-04-21T13:43:06Z)
Reinforcement Learning with Unbiased Policy Evaluation and Linear Function Approximation [11.345796608258434]
We provide performance guarantees for a variant of simulation-based policy iteration for controlling Markov decision processes. We analyze two algorithms; the first algorithm involves a least squares approach where a new set of weights associated with feature vectors is obtained via at least squares at each iteration. The second algorithm involves a two-time-scale approximation algorithm taking several steps of gradient descent towards the least squares solution.
arXiv Detail & Related papers (2022-10-13T20:16:19Z)
Random-reshuffled SARAH does not need a full gradient computations [61.85897464405715]
The StochAstic Recursive grAdientritHm (SARAH) algorithm is a variance reduced variant of the Gradient Descent (SGD) algorithm. In this paper, we remove the necessity of a full gradient. The aggregated gradients serve as an estimate of a full gradient in the SARAH algorithm.
arXiv Detail & Related papers (2021-11-26T06:00:44Z)
Lower Bounds and Optimal Algorithms for Smooth and Strongly Convex Decentralized Optimization Over Time-Varying Networks [79.16773494166644]
We consider the task of minimizing the sum of smooth and strongly convex functions stored in a decentralized manner across the nodes of a communication network. We design two optimal algorithms that attain these lower bounds. We corroborate the theoretical efficiency of these algorithms by performing an experimental comparison with existing state-of-the-art methods.
arXiv Detail & Related papers (2021-06-08T15:54:44Z)
Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling. Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z)
Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs. bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad. We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z)
Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step. Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z)
Differentially Private Accelerated Optimization Algorithms [0.7874708385247353]
We present two classes of differentially private optimization algorithms. The first algorithm is inspired by Polyak's heavy ball method. The second class of algorithms are based on Nesterov's accelerated gradient method.
arXiv Detail & Related papers (2020-08-05T08:23:01Z)
Finite-sample Analysis of Greedy-GQ with Linear Function Approximation under Markovian Noise [23.62008807533706]
This paper develops the first finite-sample analysis for the Greedy-GQ algorithm. Our paper extends the finite-sample analyses of two timescale reinforcement learning learning algorithms.
arXiv Detail & Related papers (2020-05-20T16:35:19Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.