Second-order Neural Network Training Using Complex-step Directional
Derivative
- URL: http://arxiv.org/abs/2009.07098v1
- Date: Tue, 15 Sep 2020 13:46:57 GMT
- Title: Second-order Neural Network Training Using Complex-step Directional
Derivative
- Authors: Siyuan Shen, Tianjia Shao, Kun Zhou, Chenfanfu Jiang, Feng Luo, Yin
Yang
- Abstract summary: We introduce a numerical algorithm for second-order neural network training.
We tackle the practical obstacle of Hessian calculation by using the complex-step finite difference.
We believe our method will inspire a wide-range of new algorithms for deep learning and numerical optimization.
- Score: 41.4333906662624
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While the superior performance of second-order optimization methods such as
Newton's method is well known, they are hardly used in practice for deep
learning because neither assembling the Hessian matrix nor calculating its
inverse is feasible for large-scale problems. Existing second-order methods
resort to various diagonal or low-rank approximations of the Hessian, which
often fail to capture necessary curvature information to generate a substantial
improvement. On the other hand, when training becomes batch-based (i.e.,
stochastic), noisy second-order information easily contaminates the training
procedure unless expensive safeguard is employed. In this paper, we adopt a
numerical algorithm for second-order neural network training. We tackle the
practical obstacle of Hessian calculation by using the complex-step finite
difference (CSFD) -- a numerical procedure adding an imaginary perturbation to
the function for derivative computation. CSFD is highly robust, efficient, and
accurate (as accurate as the analytic result). This method allows us to
literally apply any known second-order optimization methods for deep learning
training. Based on it, we design an effective Newton Krylov procedure. The key
mechanism is to terminate the stochastic Krylov iteration as soon as a
disturbing direction is found so that unnecessary computation can be avoided.
During the optimization, we monitor the approximation error in the Taylor
expansion to adjust the step size. This strategy combines advantages of line
search and trust region methods making our method preserves good local and
global convergency at the same time. We have tested our methods in various deep
learning tasks. The experiments show that our method outperforms exiting
methods, and it often converges one-order faster. We believe our method will
inspire a wide-range of new algorithms for deep learning and numerical
optimization.
Related papers
- Newton Losses: Using Curvature Information for Learning with Differentiable Algorithms [80.37846867546517]
We show how to train eight different neural networks with custom objectives.
We exploit their second-order information via their empirical Fisherssian matrices.
We apply Loss Lossiable algorithms to achieve significant improvements for less differentiable algorithms.
arXiv Detail & Related papers (2024-10-24T18:02:11Z) - A Computationally Efficient Sparsified Online Newton Method [48.78646010774149]
Sparsified Online Newton (SONew) is a memory-efficient second-order algorithm that yields a sparsified yet effective preconditioner.
We achieve up to 30% faster convergence, 3.4% relative improvement in validation, and 80% relative improvement in training loss.
arXiv Detail & Related papers (2023-11-16T18:44:22Z) - Efficient Methods for Non-stationary Online Learning [67.3300478545554]
We present efficient methods for optimizing dynamic regret and adaptive regret, which reduce the number of projections per round from $mathcalO(log T)$ to $1$.
Our technique hinges on the reduction mechanism developed in parameter-free online learning and requires non-trivial twists on non-stationary online methods.
arXiv Detail & Related papers (2023-09-16T07:30:12Z) - BOME! Bilevel Optimization Made Easy: A Simple First-Order Approach [46.457298683984924]
Bilevel optimization (BO) is useful for solving a variety important machine learning problems.
Conventional methods need to differentiate through the low-level optimization process with implicit differentiation.
First-order BO depends only on first-order information, requires no implicit differentiation.
arXiv Detail & Related papers (2022-09-19T01:51:12Z) - On the efficiency of Stochastic Quasi-Newton Methods for Deep Learning [0.0]
We study the behaviour of quasi-Newton training algorithms for deep memory networks.
We show that quasi-Newtons are efficient and able to outperform in some instances the well-known first-order Adam run.
arXiv Detail & Related papers (2022-05-18T20:53:58Z) - A Boosting Approach to Reinforcement Learning [59.46285581748018]
We study efficient algorithms for reinforcement learning in decision processes whose complexity is independent of the number of states.
We give an efficient algorithm that is capable of improving the accuracy of such weak learning methods.
arXiv Detail & Related papers (2021-08-22T16:00:45Z) - Exact Stochastic Second Order Deep Learning [0.0]
Deep Learning is mainly dominated by first-order methods which are built around the central concept of backpropagation.
Second-order methods, take into account second-order derivatives less used than first-order methods.
arXiv Detail & Related papers (2021-04-08T14:29:31Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.