Sketchy Empirical Natural Gradient Methods for Deep Learning
- URL: http://arxiv.org/abs/2006.05924v3
- Date: Thu, 25 Mar 2021 07:39:30 GMT
- Title: Sketchy Empirical Natural Gradient Methods for Deep Learning
- Authors: Minghan Yang, Dong Xu, Zaiwen Wen, Mengyun Chen and Pengxiang Xu
- Abstract summary: We develop an efficient sketchy empirical gradient method (SENG) for large-scale deep learning problems.
A distributed version of SENG is also developed for extremely large-scale applications.
On the task ResNet50 with ImageNet-1k, SENG achieves 75.9% Top-1 testing accuracy within 41 epochs.
- Score: 20.517823521066234
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we develop an efficient sketchy empirical natural gradient
method (SENG) for large-scale deep learning problems. The empirical Fisher
information matrix is usually low-rank since the sampling is only practical on
a small amount of data at each iteration. Although the corresponding natural
gradient direction lies in a small subspace, both the computational cost and
memory requirement are still not tractable due to the high dimensionality. We
design randomized techniques for different neural network structures to resolve
these challenges. For layers with a reasonable dimension, sketching can be
performed on a regularized least squares subproblem. Otherwise, since the
gradient is a vectorization of the product between two matrices, we apply
sketching on the low-rank approximations of these matrices to compute the most
expensive parts. A distributed version of SENG is also developed for extremely
large-scale applications. Global convergence to stationary points is
established under some mild assumptions and a fast linear convergence is
analyzed under the neural tangent kernel (NTK) case. Extensive experiments on
convolutional neural networks show the competitiveness of SENG compared with
the state-of-the-art methods. On the task ResNet50 with ImageNet-1k, SENG
achieves 75.9\% Top-1 testing accuracy within 41 epochs. Experiments on the
distributed large-batch training show that the scaling efficiency is quite
reasonable.
Related papers
- Convergence of Implicit Gradient Descent for Training Two-Layer Physics-Informed Neural Networks [3.680127959836384]
implicit gradient descent (IGD) outperforms the common gradient descent (GD) in handling certain multi-scale problems.
We show that IGD converges a globally optimal solution at a linear convergence rate.
arXiv Detail & Related papers (2024-07-03T06:10:41Z) - NeuralGF: Unsupervised Point Normal Estimation by Learning Neural
Gradient Function [55.86697795177619]
Normal estimation for 3D point clouds is a fundamental task in 3D geometry processing.
We introduce a new paradigm for learning neural gradient functions, which encourages the neural network to fit the input point clouds.
Our excellent results on widely used benchmarks demonstrate that our method can learn more accurate normals for both unoriented and oriented normal estimation tasks.
arXiv Detail & Related papers (2023-11-01T09:25:29Z) - Neural Gradient Learning and Optimization for Oriented Point Normal
Estimation [53.611206368815125]
We propose a deep learning approach to learn gradient vectors with consistent orientation from 3D point clouds for normal estimation.
We learn an angular distance field based on local plane geometry to refine the coarse gradient vectors.
Our method efficiently conducts global gradient approximation while achieving better accuracy and ability generalization of local feature description.
arXiv Detail & Related papers (2023-09-17T08:35:11Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Error-Correcting Neural Networks for Two-Dimensional Curvature
Computation in the Level-Set Method [0.0]
We present an error-neural-modeling-based strategy for approximating two-dimensional curvature in the level-set method.
Our main contribution is a redesigned hybrid solver that relies on numerical schemes to enable machine-learning operations on demand.
arXiv Detail & Related papers (2022-01-22T05:14:40Z) - DiGS : Divergence guided shape implicit neural representation for
unoriented point clouds [36.60407995156801]
Shape implicit neural representations (INRs) have recently shown to be effective in shape analysis and reconstruction tasks.
We propose a divergence guided shape representation learning approach that does not require normal vectors as input.
arXiv Detail & Related papers (2021-06-21T02:10:03Z) - Random Features for the Neural Tangent Kernel [57.132634274795066]
We propose an efficient feature map construction of the Neural Tangent Kernel (NTK) of fully-connected ReLU network.
We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.
arXiv Detail & Related papers (2021-04-03T09:08:12Z) - Learning Rates as a Function of Batch Size: A Random Matrix Theory
Approach to Neural Network Training [2.9649783577150837]
We study the effect of mini-batching on the loss landscape of deep neural networks using spiked, field-dependent random matrix theory.
We derive analytical expressions for the maximal descent and adaptive training regimens for smooth, non-Newton deep neural networks.
We validate our claims on the VGG/ResNet and ImageNet datasets.
arXiv Detail & Related papers (2020-06-16T11:55:45Z) - Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality
Regularization and Singular Value Sparsification [53.50708351813565]
We propose SVD training, the first method to explicitly achieve low-rank DNNs during training without applying SVD on every step.
We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy.
arXiv Detail & Related papers (2020-04-20T02:40:43Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.