Orthogonalising gradients to speed up neural network optimisation
- URL: http://arxiv.org/abs/2202.07052v1
- Date: Mon, 14 Feb 2022 21:46:07 GMT
- Title: Orthogonalising gradients to speed up neural network optimisation
- Authors: Mark Tuddenham, Adam Pr\"ugel-Bennett, Jonathan Hare
- Abstract summary: optimisation of neural networks can be sped up by orthogonalising the gradients before the optimisation step, ensuring the diversification of the learned representations.
We tested this method on ImageNet and CIFAR-10 resulting in a large decrease in learning time, and also obtain a speed-up on the semi-supervised learning BarlowTwins.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The optimisation of neural networks can be sped up by orthogonalising the
gradients before the optimisation step, ensuring the diversification of the
learned representations. We orthogonalise the gradients of the layer's
components/filters with respect to each other to separate out the intermediate
representations. Our method of orthogonalisation allows the weights to be used
more flexibly, in contrast to restricting the weights to an orthogonalised
sub-space. We tested this method on ImageNet and CIFAR-10 resulting in a large
decrease in learning time, and also obtain a speed-up on the semi-supervised
learning BarlowTwins. We obtain similar accuracy to SGD without fine-tuning and
better accuracy for na\"ively chosen hyper-parameters.
Related papers
- Gradient-Variation Online Learning under Generalized Smoothness [56.38427425920781]
gradient-variation online learning aims to achieve regret guarantees that scale with variations in gradients of online functions.
Recent efforts in neural network optimization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms.
We provide the applications for fast-rate convergence in games and extended adversarial optimization.
arXiv Detail & Related papers (2024-08-17T02:22:08Z) - Approximation and Gradient Descent Training with Neural Networks [0.0]
Recent work extends a neural tangent kernel (NTK) optimization argument to an under-parametrized regime.
This paper establishes analogous results for networks trained by gradient descent.
arXiv Detail & Related papers (2024-05-19T23:04:09Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Layer-wise Adaptive Step-Sizes for Stochastic First-Order Methods for
Deep Learning [8.173034693197351]
We propose a new per-layer adaptive step-size procedure for first-order optimization methods in deep learning.
The proposed approach exploits the layer-wise curvature information contained in the diagonal blocks of the Hessian in deep neural networks (DNNs) to compute adaptive step-sizes (i.e., LRs) for each layer.
Numerical experiments show that SGD with momentum and AdamW combined with the proposed per-layer step-sizes are able to choose effective LR schedules.
arXiv Detail & Related papers (2023-05-23T04:12:55Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - POGD: Gradient Descent with New Stochastic Rules [0.0]
The experiments in this paper mainly focus on the training speed to reach the target value and the ability to prevent the local minimum.
The experiments in this paper are achieved by the convolutional neural network (CNN) image classification on the MNIST and cifar-10 datasets.
arXiv Detail & Related papers (2022-10-15T12:31:02Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Improved Analysis of Clipping Algorithms for Non-convex Optimization [19.507750439784605]
Recently, citetzhang 2019gradient show that clipped (stochastic) Gradient Descent (GD) converges faster than vanilla GD/SGD.
Experiments confirm the superiority of clipping-based methods in deep learning tasks.
arXiv Detail & Related papers (2020-10-05T14:36:59Z) - Channel-Directed Gradients for Optimization of Convolutional Neural
Networks [50.34913837546743]
We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error.
We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental.
arXiv Detail & Related papers (2020-08-25T00:44:09Z) - Controllable Orthogonalization in Training DNNs [96.1365404059924]
Orthogonality is widely used for training deep neural networks (DNNs) due to its ability to maintain all singular values of the Jacobian close to 1.
This paper proposes a computationally efficient and numerically stable orthogonalization method using Newton's iteration (ONI)
We show that our method improves the performance of image classification networks by effectively controlling the orthogonality to provide an optimal tradeoff between optimization benefits and representational capacity reduction.
We also show that ONI stabilizes the training of generative adversarial networks (GANs) by maintaining the Lipschitz continuity of a network, similar to spectral normalization (
arXiv Detail & Related papers (2020-04-02T10:14:27Z) - Stochastic gradient descent with random learning rate [0.0]
We propose to optimize neural networks with a uniformly-distributed random learning rate.
By comparing the random learning rate protocol with cyclic and constant protocols, we suggest that the random choice is generically the best strategy in the small learning rate regime.
We provide supporting evidence through experiments on both shallow, fully-connected and deep, convolutional neural networks for image classification on the MNIST and CIFAR10 datasets.
arXiv Detail & Related papers (2020-03-15T21:36:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.