Scalable Second Order Optimization for Deep Learning
- URL: http://arxiv.org/abs/2002.09018v2
- Date: Fri, 5 Mar 2021 06:29:48 GMT
- Title: Scalable Second Order Optimization for Deep Learning
- Authors: Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan and Yoram Singer
- Abstract summary: We present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad)
Our novel design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models, consisting of a multicore CPU coupled with multiple accelerator units.
We demonstrate superior performance compared to state-of-the-art on very large learning tasks such as machine translation with Transformers, language modeling with BERT, click-through rate prediction on Criteo, and image classification on ImageNet with ResNet-50.
- Score: 34.12384996822749
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Optimization in machine learning, both theoretical and applied, is presently
dominated by first-order gradient methods such as stochastic gradient descent.
Second-order optimization methods, that involve second derivatives and/or
second order statistics of the data, are far less prevalent despite strong
theoretical properties, due to their prohibitive computation, memory and
communication costs. In an attempt to bridge this gap between theoretical and
practical optimization, we present a scalable implementation of a second-order
preconditioned method (concretely, a variant of full-matrix Adagrad), that
along with several critical algorithmic and numerical improvements, provides
significant convergence and wall-clock time improvements compared to
conventional first-order methods on state-of-the-art deep models. Our novel
design effectively utilizes the prevalent heterogeneous hardware architecture
for training deep models, consisting of a multicore CPU coupled with multiple
accelerator units. We demonstrate superior performance compared to
state-of-the-art on very large learning tasks such as machine translation with
Transformers, language modeling with BERT, click-through rate prediction on
Criteo, and image classification on ImageNet with ResNet-50.
Related papers
- Towards Differentiable Multilevel Optimization: A Gradient-Based Approach [1.6114012813668932]
This paper introduces a novel gradient-based approach for multilevel optimization.
Our method significantly reduces computational complexity while improving both solution accuracy and convergence speed.
To the best of our knowledge, this is one of the first algorithms to provide a general version of implicit differentiation.
arXiv Detail & Related papers (2024-10-15T06:17:59Z) - Efficient Second-Order Neural Network Optimization via Adaptive Trust Region Methods [0.0]
SecondOrderAdaptive (SOAA) is a novel optimization algorithm designed to overcome limitations of traditional second-order techniques.
We empirically demonstrate that SOAA achieves faster and more stable convergence compared to first-order approximations.
arXiv Detail & Related papers (2024-10-03T08:23:06Z) - Improving Depression estimation from facial videos with face alignment,
training optimization and scheduling [0.3441021278275805]
We propose two models based on ResNet-50 that use only static spatial information by using two specific face alignment methods.
Our experiments on benchmark datasets obtain similar results to sophisticated-temporal models for single streams or video, while the score-level fusion of two different streams outperforms state-of-the-art methods.
arXiv Detail & Related papers (2022-12-13T06:46:38Z) - A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive
Coding Networks [65.34977803841007]
Predictive coding networks are neuroscience-inspired models with roots in both Bayesian statistics and neuroscience.
We show how by simply changing the temporal scheduling of the update rule for the synaptic weights leads to an algorithm that is much more efficient and stable than the original one.
arXiv Detail & Related papers (2022-11-16T00:11:04Z) - Towards Theoretically Inspired Neural Initialization Optimization [66.04735385415427]
We propose a differentiable quantity, named GradCosine, with theoretical insights to evaluate the initial state of a neural network.
We show that both the training and test performance of a network can be improved by maximizing GradCosine under norm constraint.
Generalized from the sample-wise analysis into the real batch setting, NIO is able to automatically look for a better initialization with negligible cost.
arXiv Detail & Related papers (2022-10-12T06:49:16Z) - Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer.
We show that there is a natural synergy between these two settings.
We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z) - Second-Order Neural ODE Optimizer [11.92713188431164]
We show that a specific continuous-time OC methodology, called Differential Programming, can be adopted to derive backward ODEs for higher-order derivatives at the same O(1) memory cost.
The resulting method converges much faster than first-order baselines in wall-clock time.
Our framework also enables direct architecture optimization, such as the integration time of Neural ODEs, with second-order feedback policies.
arXiv Detail & Related papers (2021-09-29T02:58:18Z) - SHINE: SHaring the INverse Estimate from the forward pass for bi-level
optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks.
The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix.
We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z) - Bilevel Optimization: Convergence Analysis and Enhanced Design [63.64636047748605]
Bilevel optimization is a tool for many machine learning problems.
We propose a novel stoc-efficientgradient estimator named stoc-BiO.
arXiv Detail & Related papers (2020-10-15T18:09:48Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z) - Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications.
In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training.
Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.