Second-Order Neural ODE Optimizer
- URL: http://arxiv.org/abs/2109.14158v1
- Date: Wed, 29 Sep 2021 02:58:18 GMT
- Title: Second-Order Neural ODE Optimizer
- Authors: Guan-Horng Liu, Tianrong Chen, Evangelos A. Theodorou
- Abstract summary: We show that a specific continuous-time OC methodology, called Differential Programming, can be adopted to derive backward ODEs for higher-order derivatives at the same O(1) memory cost.
The resulting method converges much faster than first-order baselines in wall-clock time.
Our framework also enables direct architecture optimization, such as the integration time of Neural ODEs, with second-order feedback policies.
- Score: 11.92713188431164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel second-order optimization framework for training the
emerging deep continuous-time models, specifically the Neural Ordinary
Differential Equations (Neural ODEs). Since their training already involves
expensive gradient computation by solving a backward ODE, deriving efficient
second-order methods becomes highly nontrivial. Nevertheless, inspired by the
recent Optimal Control (OC) interpretation of training deep networks, we show
that a specific continuous-time OC methodology, called Differential
Programming, can be adopted to derive backward ODEs for higher-order
derivatives at the same O(1) memory cost. We further explore a low-rank
representation of the second-order derivatives and show that it leads to
efficient preconditioned updates with the aid of Kronecker-based factorization.
The resulting method converges much faster than first-order baselines in
wall-clock time, and the improvement remains consistent across various
applications, e.g. image classification, generative flow, and time-series
prediction. Our framework also enables direct architecture optimization, such
as the integration time of Neural ODEs, with second-order feedback policies,
strengthening the OC perspective as a principled tool of analyzing optimization
in deep learning.
Related papers
- Fast Two-Time-Scale Stochastic Gradient Method with Applications in Reinforcement Learning [5.325297567945828]
We propose a new method for two-time-scale optimization that achieves significantly faster convergence than the prior arts.
We characterize the proposed algorithm under various conditions and show how it specializes on online sample-based methods.
arXiv Detail & Related papers (2024-05-15T19:03:08Z) - Tensor-Valued Time and Inference Path Optimization in Differential Equation-Based Generative Modeling [16.874769609089764]
This work introduces, for the first time, a tensor-valued time that expands the conventional scalar-valued time into multiple dimensions.
We also propose a novel path optimization problem designed to adaptively determine multidimensional inference trajectories.
arXiv Detail & Related papers (2024-04-22T13:20:01Z) - Multiplicative update rules for accelerating deep learning training and
increasing robustness [69.90473612073767]
We propose an optimization framework that fits to a wide range of machine learning algorithms and enables one to apply alternative update rules.
We claim that the proposed framework accelerates training, while leading to more robust models in contrast to traditionally used additive update rule.
arXiv Detail & Related papers (2023-07-14T06:44:43Z) - Towards Theoretically Inspired Neural Initialization Optimization [66.04735385415427]
We propose a differentiable quantity, named GradCosine, with theoretical insights to evaluate the initial state of a neural network.
We show that both the training and test performance of a network can be improved by maximizing GradCosine under norm constraint.
Generalized from the sample-wise analysis into the real batch setting, NIO is able to automatically look for a better initialization with negligible cost.
arXiv Detail & Related papers (2022-10-12T06:49:16Z) - First-Order Optimization Inspired from Finite-Time Convergent Flows [26.931390502212825]
We propose an Euler discretization for first-order finite-time flows, and provide convergence guarantees, in the deterministic and the deterministic setting.
We then apply the proposed algorithms to academic examples, as well as deep neural networks training, where we empirically test their performances on the SVHN dataset.
Our results show that our schemes demonstrate faster convergences against standard optimization alternatives.
arXiv Detail & Related papers (2020-10-06T19:28:00Z) - An Ode to an ODE [78.97367880223254]
We present a new paradigm for Neural ODE algorithms, called ODEtoODE, where time-dependent parameters of the main flow evolve according to a matrix flow on the group O(d)
This nested system of two flows provides stability and effectiveness of training and provably solves the gradient vanishing-explosion problem.
arXiv Detail & Related papers (2020-06-19T22:05:19Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z) - On Second Order Behaviour in Augmented Neural ODEs [69.8070643951126]
We consider Second Order Neural ODEs (SONODEs)
We show how the adjoint sensitivity method can be extended to SONODEs.
We extend the theoretical understanding of the broader class of Augmented NODEs (ANODEs)
arXiv Detail & Related papers (2020-06-12T14:25:31Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z) - Scalable Second Order Optimization for Deep Learning [34.12384996822749]
We present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad)
Our novel design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models, consisting of a multicore CPU coupled with multiple accelerator units.
We demonstrate superior performance compared to state-of-the-art on very large learning tasks such as machine translation with Transformers, language modeling with BERT, click-through rate prediction on Criteo, and image classification on ImageNet with ResNet-50.
arXiv Detail & Related papers (2020-02-20T20:51:33Z) - DDPNOpt: Differential Dynamic Programming Neural Optimizer [29.82841891919951]
We show that most widely-used algorithms for trainings can be linked to the Differential Dynamic Programming (DDP)
In this vein, we propose a new class of DDPOpt, for training feedforward and convolution networks.
arXiv Detail & Related papers (2020-02-20T15:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.