Adapting Newton's Method to Neural Networks through a Summary of
Higher-Order Derivatives
- URL: http://arxiv.org/abs/2312.03885v2
- Date: Sat, 3 Feb 2024 09:00:08 GMT
- Title: Adapting Newton's Method to Neural Networks through a Summary of
Higher-Order Derivatives
- Authors: Pierre Wolinski
- Abstract summary: We consider a gradient-based optimization method applied to a function $boldsymboltheta$.
This framework encompasses many common use-cases, such as training neural networks by gradient descent.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider a gradient-based optimization method applied to a function
$\mathcal{L}$ of a vector of variables $\boldsymbol{\theta}$, in the case where
$\boldsymbol{\theta}$ is represented as a tuple of tensors $(\mathbf{T}_1,
\cdots, \mathbf{T}_S)$. This framework encompasses many common use-cases, such
as training neural networks by gradient descent. First, we propose a
computationally inexpensive technique providing higher-order information on
$\mathcal{L}$, especially about the interactions between the tensors
$\mathbf{T}_s$, based on automatic differentiation and computational tricks.
Second, we use this technique at order 2 to build a second-order optimization
method which is suitable, among other things, for training deep neural networks
of various architectures. This second-order method leverages the partition
structure of $\boldsymbol{\theta}$ into tensors $(\mathbf{T}_1, \cdots,
\mathbf{T}_S)$, in such a way that it requires neither the computation of the
Hessian of $\mathcal{L}$ according to $\boldsymbol{\theta}$, nor any
approximation of it. The key part consists in computing a smaller matrix
interpretable as a "Hessian according to the partition", which can be computed
exactly and efficiently. In contrast to many existing practical second-order
methods used in neural networks, which perform a diagonal or block-diagonal
approximation of the Hessian or its inverse, the method we propose does not
neglect interactions between layers. Finally, we can tune the coarseness of the
partition to recover well-known optimization methods: the coarsest case
corresponds to Cauchy's steepest descent method, the finest case corresponds to
the usual Newton's method.
Related papers
- A Quasilinear Algorithm for Computing Higher-Order Derivatives of Deep Feed-Forward Neural Networks [0.0]
$n$-TangentProp computes the exact derivative $dn/dxn f(x)$ in quasilinear, instead of exponential time.
We demonstrate that our method is particularly beneficial in the context of physics-informed neural networks.
arXiv Detail & Related papers (2024-12-12T22:57:28Z) - FORML: A Riemannian Hessian-free Method for Meta-learning on Stiefel Manifolds [4.757859522106933]
This paper introduces a Hessian-free approach that uses a first-order approximation of derivatives on the Stiefel manifold.
Our method significantly reduces the computational load and memory footprint.
arXiv Detail & Related papers (2024-02-28T10:57:30Z) - Self-concordant Smoothing for Large-Scale Convex Composite Optimization [0.0]
We introduce a notion of self-concordant smoothing for minimizing the sum of two convex functions, one of which is smooth and the other may be nonsmooth.
We prove the convergence of two resulting algorithms: Prox-N-SCORE, a proximal Newton algorithm and Prox-GGN-SCORE, a proximal generalized Gauss-Newton algorithm.
arXiv Detail & Related papers (2023-09-04T19:47:04Z) - Stochastic Marginal Likelihood Gradients using Neural Tangent Kernels [78.6096486885658]
We introduce lower bounds to the linearized Laplace approximation of the marginal likelihood.
These bounds are amenable togradient-based optimization and allow to trade off estimation accuracy against computational complexity.
arXiv Detail & Related papers (2023-06-06T19:02:57Z) - Optimization using Parallel Gradient Evaluations on Multiple Parameters [51.64614793990665]
We propose a first-order method for convex optimization, where gradients from multiple parameters can be used during each step of gradient descent.
Our method uses gradients from multiple parameters in synergy to update these parameters together towards the optima.
arXiv Detail & Related papers (2023-02-06T23:39:13Z) - Combinatorial optimization for low bit-width neural networks [23.466606660363016]
Low-bit width neural networks have been extensively explored for deployment on edge devices to reduce computational resources.
Existing approaches have focused on gradient-based optimization in a two-stage train-and-compress setting.
We show that a combination of greedy coordinate descent and this novel approach can attain competitive accuracy on binary classification tasks.
arXiv Detail & Related papers (2022-06-04T15:02:36Z) - Reducing the Variance of Gaussian Process Hyperparameter Optimization
with Preconditioning [54.01682318834995]
Preconditioning is a highly effective step for any iterative method involving matrix-vector multiplication.
We prove that preconditioning has an additional benefit that has been previously unexplored.
It simultaneously can reduce variance at essentially negligible cost.
arXiv Detail & Related papers (2021-07-01T06:43:11Z) - SHINE: SHaring the INverse Estimate from the forward pass for bi-level
optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks.
The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix.
We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z) - Implicit differentiation for fast hyperparameter selection in non-smooth
convex learning [87.60600646105696]
We study first-order methods when the inner optimization problem is convex but non-smooth.
We show that the forward-mode differentiation of proximal gradient descent and proximal coordinate descent yield sequences of Jacobians converging toward the exact Jacobian.
arXiv Detail & Related papers (2021-05-04T17:31:28Z) - DyCo3D: Robust Instance Segmentation of 3D Point Clouds through Dynamic
Convolution [136.7261709896713]
We propose a data-driven approach that generates the appropriate convolution kernels to apply in response to the nature of the instances.
The proposed method achieves promising results on both ScanetNetV2 and S3DIS.
It also improves inference speed by more than 25% over the current state-of-the-art.
arXiv Detail & Related papers (2020-11-26T14:56:57Z) - Supervised Learning for Non-Sequential Data: A Canonical Polyadic
Decomposition Approach [85.12934750565971]
Efficient modelling of feature interactions underpins supervised learning for non-sequential tasks.
To alleviate this issue, it has been proposed to implicitly represent the model parameters as a tensor.
For enhanced expressiveness, we generalize the framework to allow feature mapping to arbitrarily high-dimensional feature vectors.
arXiv Detail & Related papers (2020-01-27T22:38:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.