Learning Sampling Policy for Faster Derivative Free Optimization
- URL: http://arxiv.org/abs/2104.04405v1
- Date: Fri, 9 Apr 2021 14:50:59 GMT
- Title: Learning Sampling Policy for Faster Derivative Free Optimization
- Authors: Zhou Zhai, Bin Gu, and Heng Huang
- Abstract summary: We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
- Score: 100.27518340593284
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Zeroth-order (ZO, also known as derivative-free) methods, which estimate the
gradient only by two function evaluations, have attracted much attention
recently because of its broad applications in machine learning community. The
two function evaluations are normally generated with random perturbations from
standard Gaussian distribution. To speed up ZO methods, many methods, such as
variance reduced stochastic ZO gradients and learning an adaptive Gaussian
distribution, have recently been proposed to reduce the variances of ZO
gradients. However, it is still an open problem whether there is a space to
further improve the convergence of ZO methods. To explore this problem, in this
paper, we propose a new reinforcement learning based ZO algorithm (ZO-RL) with
learning the sampling policy for generating the perturbations in ZO
optimization instead of using random sampling. To find the optimal policy, an
actor-critic RL algorithm called deep deterministic policy gradient (DDPG) with
two neural network function approximators is adopted. The learned sampling
policy guides the perturbed points in the parameter space to estimate a more
accurate ZO gradient. To the best of our knowledge, our ZO-RL is the first
algorithm to learn the sampling policy using reinforcement learning for ZO
optimization which is parallel to the existing methods. Especially, our ZO-RL
can be combined with existing ZO algorithms that could further accelerate the
algorithms. Experimental results for different ZO optimization problems show
that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by
learning a sampling policy, and converge faster than existing ZO algorithms in
different scenarios.
Related papers
- Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses [5.052293146674794]
It is known that the standard descent (SGD) optimization method, as well as accelerated and adaptive SGD optimization methods such as the Adam fail to converge if the learning rates do not converge to zero.
In this work we propose and study a learning-rate-adaptive approach for SGD optimization methods in which the learning rate is adjusted based on empirical estimates.
arXiv Detail & Related papers (2024-06-20T14:07:39Z) - BiSLS/SPS: Auto-tune Step Sizes for Stable Bi-level Optimization [33.082961718280245]
Existing algorithms involve two coupled learning rates that can be affected by approximation errors when computing hypergradients.
We investigate the use of adaptive step-size methods, namely line search (SLS) and Polyak step size (SPS), for computing both the upper and lower-level learning rates.
New algorithms, which are available in both SGD and Adam versions, can find large learning rates with minimal tuning and converge faster than corresponding vanilla BO algorithms.
arXiv Detail & Related papers (2023-05-30T00:37:50Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Constrained Reinforcement Learning via Dissipative Saddle Flow Dynamics [5.270497591225775]
In constrained reinforcement learning (C-RL), an agent seeks to learn from the environment a policy that maximizes the expected cumulative reward.
Several algorithms rooted in sampled-based primal-dual methods have been recently proposed to solve this problem in policy space.
We propose a novel algorithm for constrained RL that does not suffer from these limitations.
arXiv Detail & Related papers (2022-12-03T01:54:55Z) - A Policy Efficient Reduction Approach to Convex Constrained Deep
Reinforcement Learning [2.811714058940267]
We propose a new variant of the conditional gradient (CG) type algorithm, which generalizes the minimum norm point (MNP) method.
Our method reduces the memory costs by an order of magnitude, and achieves better performance, demonstrating both its effectiveness and efficiency.
arXiv Detail & Related papers (2021-08-29T20:51:32Z) - Bregman Gradient Policy Optimization [97.73041344738117]
We design a Bregman gradient policy optimization for reinforcement learning based on Bregman divergences and momentum techniques.
VR-BGPO reaches the best complexity $tilde(epsilon-3)$ for finding an $epsilon$stationary point only requiring one trajectory at each iteration.
arXiv Detail & Related papers (2021-06-23T01:08:54Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Learning to be Global Optimizer [28.88646928299302]
We learn an optimal network and escaping capability algorithm for some benchmark functions.
We show that the learned algorithm significantly outperforms some well-known classical optimization algorithms.
arXiv Detail & Related papers (2020-03-10T03:46:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.