Input Normalized Stochastic Gradient Descent Training of Deep Neural
Networks
- URL: http://arxiv.org/abs/2212.09921v2
- Date: Mon, 26 Jun 2023 19:43:31 GMT
- Title: Input Normalized Stochastic Gradient Descent Training of Deep Neural
Networks
- Authors: Salih Atici, Hongyi Pan, Ahmet Enis Cetin
- Abstract summary: In this paper, we propose a novel optimization algorithm for training machine learning models called Input Normalized Gradient Descent (INSGD)
Our algorithm updates the network weights using gradient descent with $ell_$ and $ell_$-based normalizations applied to the learning rate, similar to NLMS.
We evaluate the efficiency of our training algorithm on benchmark datasets using ResNet-18, WResNet-20, ResNet-50, and a toy neural network.
- Score: 2.1485350418225244
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a novel optimization algorithm for training machine
learning models called Input Normalized Stochastic Gradient Descent (INSGD),
inspired by the Normalized Least Mean Squares (NLMS) algorithm used in adaptive
filtering. When training complex models on large datasets, the choice of
optimizer parameters, particularly the learning rate, is crucial to avoid
divergence. Our algorithm updates the network weights using stochastic gradient
descent with $\ell_1$ and $\ell_2$-based normalizations applied to the learning
rate, similar to NLMS. However, unlike existing normalization methods, we
exclude the error term from the normalization process and instead normalize the
update term using the input vector to the neuron. Our experiments demonstrate
that our optimization algorithm achieves higher accuracy levels compared to
different initialization settings. We evaluate the efficiency of our training
algorithm on benchmark datasets using ResNet-18, WResNet-20, ResNet-50, and a
toy neural network. Our INSGD algorithm improves the accuracy of ResNet-18 on
CIFAR-10 from 92.42\% to 92.71\%, WResNet-20 on CIFAR-100 from 76.20\% to
77.39\%, and ResNet-50 on ImageNet-1K from 75.52\% to 75.67\%.
Related papers
- Towards Generalized Entropic Sparsification for Convolutional Neural Networks [0.0]
Convolutional neural networks (CNNs) are reported to be overparametrized.
Here, we introduce a layer-by-layer data-driven pruning method based on the mathematical idea aiming at a computationally-scalable entropic relaxation of the pruning problem.
The sparse subnetwork is found from the pre-trained (full) CNN using the network entropy minimization as a sparsity constraint.
arXiv Detail & Related papers (2024-04-06T21:33:39Z) - Genetically Modified Wolf Optimization with Stochastic Gradient Descent
for Optimising Deep Neural Networks [0.0]
This research aims to analyze an alternative approach to optimizing neural network (NN) weights, with the use of population-based metaheuristic algorithms.
A hybrid between Grey Wolf (GWO) and Genetic Modified Algorithms (GA) is explored, in conjunction with Gradient Descent (SGD)
This algorithm allows for a combination between exploitation and exploration, whilst also tackling the issue of high-dimensionality.
arXiv Detail & Related papers (2023-01-21T13:22:09Z) - Towards Theoretically Inspired Neural Initialization Optimization [66.04735385415427]
We propose a differentiable quantity, named GradCosine, with theoretical insights to evaluate the initial state of a neural network.
We show that both the training and test performance of a network can be improved by maximizing GradCosine under norm constraint.
Generalized from the sample-wise analysis into the real batch setting, NIO is able to automatically look for a better initialization with negligible cost.
arXiv Detail & Related papers (2022-10-12T06:49:16Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Exploiting Adam-like Optimization Algorithms to Improve the Performance
of Convolutional Neural Networks [82.61182037130405]
gradient descent (SGD) is the main approach for training deep networks.
In this work, we compare Adam based variants based on the difference between the present and the past gradients.
We have tested ensemble of networks and the fusion with ResNet50 trained with gradient descent.
arXiv Detail & Related papers (2021-03-26T18:55:08Z) - GradInit: Learning to Initialize Neural Networks for Stable and
Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks.
It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.
It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z) - Non-Parametric Adaptive Network Pruning [125.4414216272874]
We introduce non-parametric modeling to simplify the algorithm design.
Inspired by the face recognition community, we use a message passing algorithm to obtain an adaptive number of exemplars.
EPruner breaks the dependency on the training data in determining the "important" filters.
arXiv Detail & Related papers (2021-01-20T06:18:38Z) - Training Sparse Neural Networks using Compressed Sensing [13.84396596420605]
We develop and test a novel method based on compressed sensing which combines the pruning and training into a single step.
Specifically, we utilize an adaptively weighted $ell1$ penalty on the weights during training, which we combine with a generalization of the regularized dual averaging (RDA) algorithm in order to train sparse neural networks.
arXiv Detail & Related papers (2020-08-21T19:35:54Z) - Filter Sketch for Network Pruning [184.41079868885265]
We propose a novel network pruning approach by information preserving of pre-trained network weights (filters)
Our approach, referred to as FilterSketch, encodes the second-order information of pre-trained weights.
Experiments on CIFAR-10 show that FilterSketch reduces 63.3% of FLOPs and prunes 59.9% of network parameters with negligible accuracy cost.
arXiv Detail & Related papers (2020-01-23T13:57:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.