Related papers: Softmax Gradient Tampering: Decoupling the Backward Pass for Improved Fitting

Softmax Gradient Tampering: Decoupling the Backward Pass for Improved Fitting

URL: http://arxiv.org/abs/2111.12495v1
Date: Wed, 24 Nov 2021 13:47:36 GMT
Title: Softmax Gradient Tampering: Decoupling the Backward Pass for Improved Fitting
Authors: Bishshoy Das, Milton Mondal, Brejesh Lall, Shiv Dutt Joshi, Sumantra Dutta Roy
Abstract summary: We introduce Softmax Gradient Tampering, a technique for modifying the gradients in the backward pass of neural networks. We demonstrate that modifying the softmax gradients in ConvNets may result in increased training accuracy.
Score: 8.072117741487046
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Softmax Gradient Tampering, a technique for modifying the gradients in the backward pass of neural networks in order to enhance their accuracy. Our approach transforms the predicted probability values using a power-based probability transformation and then recomputes the gradients in the backward pass. This modification results in a smoother gradient profile, which we demonstrate empirically and theoretically. We do a grid search for the transform parameters on residual networks. We demonstrate that modifying the softmax gradients in ConvNets may result in increased training accuracy, thus increasing the fit across the training data and maximally utilizing the learning capacity of neural networks. We get better test metrics and lower generalization gaps when combined with regularization techniques such as label smoothing. Softmax gradient tampering improves ResNet-50's test accuracy by $0.52\%$ over the baseline on the ImageNet dataset. Our approach is very generic and may be used across a wide range of different network architectures and datasets.

Related papers

A Numerical Gradient Inversion Attack in Variational Quantum Neural-Networks [4.086403209504347]
The loss landscape of Variational Quantum Neural Networks (VQNNs) is characterized by local minima that grow exponentially with increasing qubits. We present a numerical scheme that successfully reconstructs input training, real-world, practical data from trainable VQNNs' gradients.
arXiv Detail & Related papers (2025-04-17T10:12:38Z)
Fast and Slow Gradient Approximation for Binary Neural Network Optimization [11.064044986709733]
hypernetwork based methods utilize neural networks to learn the gradients of non-differentiable quantization functions. We propose a Historical Gradient Storage (HGS) module, which models the historical gradient sequence to generate the first-order momentum required for optimization. We also introduce Layer Recognition Embeddings (LRE) into the hypernetwork, facilitating the generation of layer-specific fine gradients.
arXiv Detail & Related papers (2024-12-16T13:48:40Z)
Mitigating Gradient Overlap in Deep Residual Networks with Gradient Normalization for Improved Non-Convex Optimization [0.0]
In deep learning, Residual Networks (ResNets) have proven effective in addressing the vanishing problem. skip connections in ResNets can lead to overlap, where from both the learned transformation and skip connection combine in gradients. We examine Z-score Normalization (ZNorm) as a technique to manage overlap.
arXiv Detail & Related papers (2024-10-28T21:54:44Z)
Gradient-Variation Online Learning under Generalized Smoothness [56.38427425920781]
gradient-variation online learning aims to achieve regret guarantees that scale with variations in gradients of online functions. Recent efforts in neural network optimization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms. We provide the applications for fast-rate convergence in games and extended adversarial optimization.
arXiv Detail & Related papers (2024-08-17T02:22:08Z)
Forward Gradient-Based Frank-Wolfe Optimization for Memory Efficient Deep Neural Network Training [0.0]
This paper focuses on analyzing the performance of the well-known Frank-Wolfe algorithm. We show the proposed algorithm does converge to the optimal solution with a sub-linear rate of convergence. In contrast, the standard Frank-Wolfe algorithm, when provided with access to the Projected Forward Gradient, fails to converge to the optimal solution.
arXiv Detail & Related papers (2024-03-19T07:25:36Z)
How to guess a gradient [68.98681202222664]
We show that gradients are more structured than previously thought. Exploiting this structure can significantly improve gradient-free optimization schemes. We highlight new challenges in overcoming the large gap between optimizing with exact gradients and guessing the gradients.
arXiv Detail & Related papers (2023-12-07T21:40:44Z)
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two. For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z)
Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks. We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z)
Exploiting Adam-like Optimization Algorithms to Improve the Performance of Convolutional Neural Networks [82.61182037130405]
gradient descent (SGD) is the main approach for training deep networks. In this work, we compare Adam based variants based on the difference between the present and the past gradients. We have tested ensemble of networks and the fusion with ResNet50 trained with gradient descent.
arXiv Detail & Related papers (2021-03-26T18:55:08Z)
Channel-Directed Gradients for Optimization of Convolutional Neural Networks [50.34913837546743]
We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error. We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental.
arXiv Detail & Related papers (2020-08-25T00:44:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.