Related papers: Second-order Neural Network Training Using Complex-step Directional Derivative

Second-order Neural Network Training Using Complex-step Directional Derivative

URL: http://arxiv.org/abs/2009.07098v1
Date: Tue, 15 Sep 2020 13:46:57 GMT
Title: Second-order Neural Network Training Using Complex-step Directional Derivative
Authors: Siyuan Shen, Tianjia Shao, Kun Zhou, Chenfanfu Jiang, Feng Luo, Yin Yang
Abstract summary: We introduce a numerical algorithm for second-order neural network training. We tackle the practical obstacle of Hessian calculation by using the complex-step finite difference. We believe our method will inspire a wide-range of new algorithms for deep learning and numerical optimization.
Score: 41.4333906662624
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While the superior performance of second-order optimization methods such as Newton's method is well known, they are hardly used in practice for deep learning because neither assembling the Hessian matrix nor calculating its inverse is feasible for large-scale problems. Existing second-order methods resort to various diagonal or low-rank approximations of the Hessian, which often fail to capture necessary curvature information to generate a substantial improvement. On the other hand, when training becomes batch-based (i.e., stochastic), noisy second-order information easily contaminates the training procedure unless expensive safeguard is employed. In this paper, we adopt a numerical algorithm for second-order neural network training. We tackle the practical obstacle of Hessian calculation by using the complex-step finite difference (CSFD) -- a numerical procedure adding an imaginary perturbation to the function for derivative computation. CSFD is highly robust, efficient, and accurate (as accurate as the analytic result). This method allows us to literally apply any known second-order optimization methods for deep learning training. Based on it, we design an effective Newton Krylov procedure. The key mechanism is to terminate the stochastic Krylov iteration as soon as a disturbing direction is found so that unnecessary computation can be avoided. During the optimization, we monitor the approximation error in the Taylor expansion to adjust the step size. This strategy combines advantages of line search and trust region methods making our method preserves good local and global convergency at the same time. We have tested our methods in various deep learning tasks. The experiments show that our method outperforms exiting methods, and it often converges one-order faster. We believe our method will inspire a wide-range of new algorithms for deep learning and numerical optimization.

Related papers

Towards Practical Second-Order Optimizers in Deep Learning: Insights from Fisher Information Analysis [0.0]
We present AdaFisher, a novel adaptive second-order tuning for deep neural networks (DNNs) AdaFisher aims to bridge the gap between the improved convergence and generalization of second-order methods and the computational efficiency needed for trainings. We demonstrate that AdaFisher outperforms state-of-the-art approximations in both accuracy and convergence speed.
arXiv Detail & Related papers (2025-04-26T05:02:21Z)
FUSE: First-Order and Second-Order Unified SynthEsis in Stochastic Optimization [9.909119107223265]
First-order and second-order methods are in entirely different situations. This paper presents a novel method that leverages both the first-order and second-order methods in a unified algorithmic framework. FUSE-PV stands as a simple yet efficient optimization method involving a switch-over between first and second orders.
arXiv Detail & Related papers (2025-03-06T08:30:18Z)
Newton Losses: Using Curvature Information for Learning with Differentiable Algorithms [80.37846867546517]
We show how to train eight different neural networks with custom objectives. We exploit their second-order information via their empirical Fisherssian matrices. We apply Loss Lossiable algorithms to achieve significant improvements for less differentiable algorithms.
arXiv Detail & Related papers (2024-10-24T18:02:11Z)
A Computationally Efficient Sparsified Online Newton Method [48.78646010774149]
Sparsified Online Newton (SONew) is a memory-efficient second-order algorithm that yields a sparsified yet effective preconditioner. We achieve up to 30% faster convergence, 3.4% relative improvement in validation, and 80% relative improvement in training loss.
arXiv Detail & Related papers (2023-11-16T18:44:22Z)
Efficient Methods for Non-stationary Online Learning [67.3300478545554]
We present efficient methods for optimizing dynamic regret and adaptive regret, which reduce the number of projections per round from $mathcalO(log T)$ to $1$. Our technique hinges on the reduction mechanism developed in parameter-free online learning and requires non-trivial twists on non-stationary online methods.
arXiv Detail & Related papers (2023-09-16T07:30:12Z)
BOME! Bilevel Optimization Made Easy: A Simple First-Order Approach [46.457298683984924]
Bilevel optimization (BO) is useful for solving a variety important machine learning problems. Conventional methods need to differentiate through the low-level optimization process with implicit differentiation. First-order BO depends only on first-order information, requires no implicit differentiation.
arXiv Detail & Related papers (2022-09-19T01:51:12Z)
On the efficiency of Stochastic Quasi-Newton Methods for Deep Learning [0.0]
We study the behaviour of quasi-Newton training algorithms for deep memory networks. We show that quasi-Newtons are efficient and able to outperform in some instances the well-known first-order Adam run.
arXiv Detail & Related papers (2022-05-18T20:53:58Z)
A Boosting Approach to Reinforcement Learning [59.46285581748018]
We study efficient algorithms for reinforcement learning in decision processes whose complexity is independent of the number of states. We give an efficient algorithm that is capable of improving the accuracy of such weak learning methods.
arXiv Detail & Related papers (2021-08-22T16:00:45Z)
Exact Stochastic Second Order Deep Learning [0.0]
Deep Learning is mainly dominated by first-order methods which are built around the central concept of backpropagation. Second-order methods, take into account second-order derivatives less used than first-order methods.
arXiv Detail & Related papers (2021-04-08T14:29:31Z)
Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step. Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.