Adapting Newton's Method to Neural Networks through a Summary of
Higher-Order Derivatives
- URL: http://arxiv.org/abs/2312.03885v2
- Date: Sat, 3 Feb 2024 09:00:08 GMT
- Title: Adapting Newton's Method to Neural Networks through a Summary of
Higher-Order Derivatives
- Authors: Pierre Wolinski
- Abstract summary: We consider a gradient-based optimization method applied to a function $boldsymboltheta$.
This framework encompasses many common use-cases, such as training neural networks by gradient descent.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider a gradient-based optimization method applied to a function
$\mathcal{L}$ of a vector of variables $\boldsymbol{\theta}$, in the case where
$\boldsymbol{\theta}$ is represented as a tuple of tensors $(\mathbf{T}_1,
\cdots, \mathbf{T}_S)$. This framework encompasses many common use-cases, such
as training neural networks by gradient descent. First, we propose a
computationally inexpensive technique providing higher-order information on
$\mathcal{L}$, especially about the interactions between the tensors
$\mathbf{T}_s$, based on automatic differentiation and computational tricks.
Second, we use this technique at order 2 to build a second-order optimization
method which is suitable, among other things, for training deep neural networks
of various architectures. This second-order method leverages the partition
structure of $\boldsymbol{\theta}$ into tensors $(\mathbf{T}_1, \cdots,
\mathbf{T}_S)$, in such a way that it requires neither the computation of the
Hessian of $\mathcal{L}$ according to $\boldsymbol{\theta}$, nor any
approximation of it. The key part consists in computing a smaller matrix
interpretable as a "Hessian according to the partition", which can be computed
exactly and efficiently. In contrast to many existing practical second-order
methods used in neural networks, which perform a diagonal or block-diagonal
approximation of the Hessian or its inverse, the method we propose does not
neglect interactions between layers. Finally, we can tune the coarseness of the
partition to recover well-known optimization methods: the coarsest case
corresponds to Cauchy's steepest descent method, the finest case corresponds to
the usual Newton's method.
Related papers
- Inverting the Leverage Score Gradient: An Efficient Approximate Newton Method [10.742859956268655]
This paper aims to recover the intrinsic model parameters given the leverage scores gradient.
We specifically scrutinize the inversion of the leverage score gradient, denoted as $g(x)$.
arXiv Detail & Related papers (2024-08-21T01:39:42Z) - Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ under isotropic Gaussian data.
We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary link function with a sample and runtime complexity of $n asymp T asymp C(q) cdot d
arXiv Detail & Related papers (2024-06-03T17:56:58Z) - Randomized Block-Coordinate Optimistic Gradient Algorithms for
Root-Finding Problems [8.0153031008486]
We develop two new algorithms to approximate a solution of nonlinear equations in large-scale settings.
We apply our methods to a class of large-scale finite-sum inclusions, which covers prominent applications in machine learning.
arXiv Detail & Related papers (2023-01-08T21:46:27Z) - Optimal Gradient Sliding and its Application to Distributed Optimization
Under Similarity [121.83085611327654]
We structured convex optimization problems with additive objective $r:=p + q$, where $r$ is $mu$-strong convex similarity.
We proposed a method to solve problems master to agents' communication and local calls.
The proposed method is much sharper than the $mathcalO(sqrtL_q/mu)$ method.
arXiv Detail & Related papers (2022-05-30T14:28:02Z) - High-dimensional Asymptotics of Feature Learning: How One Gradient Step
Improves the Representation [89.21686761957383]
We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network.
Our results demonstrate that even one step can lead to a considerable advantage over random features.
arXiv Detail & Related papers (2022-05-03T12:09:59Z) - Beyond Lazy Training for Over-parameterized Tensor Decomposition [69.4699995828506]
We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
arXiv Detail & Related papers (2020-10-22T00:32:12Z) - Training (Overparametrized) Neural Networks in Near-Linear Time [21.616949485102342]
We show how to speed up the algorithm of [CGH+1] for training (mildly overetrized) ReparamLU networks.
The centerpiece of our algorithm is to reformulate the Gauss-Newton as an $ell$-recondition.
arXiv Detail & Related papers (2020-06-20T20:26:14Z) - Linear Time Sinkhorn Divergences using Positive Features [51.50788603386766]
Solving optimal transport with an entropic regularization requires computing a $ntimes n$ kernel matrix that is repeatedly applied to a vector.
We propose to use instead ground costs of the form $c(x,y)=-logdotpvarphi(x)varphi(y)$ where $varphi$ is a map from the ground space onto the positive orthant $RRr_+$, with $rll n$.
arXiv Detail & Related papers (2020-06-12T10:21:40Z) - On the Modularity of Hypernetworks [103.1147622394852]
We show that for a structured target function, the overall number of trainable parameters in a hypernetwork is smaller by orders of magnitude than the number of trainable parameters of a standard neural network and an embedding method.
arXiv Detail & Related papers (2020-02-23T22:51:52Z) - A Corrective View of Neural Networks: Representation, Memorization and
Learning [26.87238691716307]
We develop a corrective mechanism for neural network approximation.
We show that two-layer neural networks in the random features regime (RF) can memorize arbitrary labels.
We also consider three-layer neural networks and show that the corrective mechanism yields faster representation rates for smooth radial functions.
arXiv Detail & Related papers (2020-02-01T20:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.