Softmax Gradient Tampering: Decoupling the Backward Pass for Improved
Fitting
- URL: http://arxiv.org/abs/2111.12495v1
- Date: Wed, 24 Nov 2021 13:47:36 GMT
- Title: Softmax Gradient Tampering: Decoupling the Backward Pass for Improved
Fitting
- Authors: Bishshoy Das, Milton Mondal, Brejesh Lall, Shiv Dutt Joshi, Sumantra
Dutta Roy
- Abstract summary: We introduce Softmax Gradient Tampering, a technique for modifying the gradients in the backward pass of neural networks.
We demonstrate that modifying the softmax gradients in ConvNets may result in increased training accuracy.
- Score: 8.072117741487046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Softmax Gradient Tampering, a technique for modifying the
gradients in the backward pass of neural networks in order to enhance their
accuracy. Our approach transforms the predicted probability values using a
power-based probability transformation and then recomputes the gradients in the
backward pass. This modification results in a smoother gradient profile, which
we demonstrate empirically and theoretically. We do a grid search for the
transform parameters on residual networks. We demonstrate that modifying the
softmax gradients in ConvNets may result in increased training accuracy, thus
increasing the fit across the training data and maximally utilizing the
learning capacity of neural networks. We get better test metrics and lower
generalization gaps when combined with regularization techniques such as label
smoothing. Softmax gradient tampering improves ResNet-50's test accuracy by
$0.52\%$ over the baseline on the ImageNet dataset. Our approach is very
generic and may be used across a wide range of different network architectures
and datasets.
Related papers
- Mitigating Gradient Overlap in Deep Residual Networks with Gradient Normalization for Improved Non-Convex Optimization [0.0]
In deep learning, Residual Networks (ResNets) have proven effective in addressing the vanishing problem.
skip connections in ResNets can lead to overlap, where from both the learned transformation and skip connection combine in gradients.
We examine Z-score Normalization (ZNorm) as a technique to manage overlap.
arXiv Detail & Related papers (2024-10-28T21:54:44Z) - Gradient-Variation Online Learning under Generalized Smoothness [56.38427425920781]
gradient-variation online learning aims to achieve regret guarantees that scale with variations in gradients of online functions.
Recent efforts in neural network optimization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms.
We provide the applications for fast-rate convergence in games and extended adversarial optimization.
arXiv Detail & Related papers (2024-08-17T02:22:08Z) - Forward Gradient-Based Frank-Wolfe Optimization for Memory Efficient Deep Neural Network Training [0.0]
This paper focuses on analyzing the performance of the well-known Frank-Wolfe algorithm.
We show the proposed algorithm does converge to the optimal solution with a sub-linear rate of convergence.
In contrast, the standard Frank-Wolfe algorithm, when provided with access to the Projected Forward Gradient, fails to converge to the optimal solution.
arXiv Detail & Related papers (2024-03-19T07:25:36Z) - How to guess a gradient [68.98681202222664]
We show that gradients are more structured than previously thought.
Exploiting this structure can significantly improve gradient-free optimization schemes.
We highlight new challenges in overcoming the large gap between optimizing with exact gradients and guessing the gradients.
arXiv Detail & Related papers (2023-12-07T21:40:44Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Exploiting Adam-like Optimization Algorithms to Improve the Performance
of Convolutional Neural Networks [82.61182037130405]
gradient descent (SGD) is the main approach for training deep networks.
In this work, we compare Adam based variants based on the difference between the present and the past gradients.
We have tested ensemble of networks and the fusion with ResNet50 trained with gradient descent.
arXiv Detail & Related papers (2021-03-26T18:55:08Z) - Channel-Directed Gradients for Optimization of Convolutional Neural
Networks [50.34913837546743]
We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error.
We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental.
arXiv Detail & Related papers (2020-08-25T00:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.