Scaling Private Deep Learning with Low-Rank and Sparse Gradients
- URL: http://arxiv.org/abs/2207.02699v1
- Date: Wed, 6 Jul 2022 14:09:47 GMT
- Title: Scaling Private Deep Learning with Low-Rank and Sparse Gradients
- Authors: Ryuichi Ito, Seng Pei Liew, Tsubasa Takahashi, Yuya Sasaki, Makoto
Onizuka
- Abstract summary: We propose a framework that exploits the low-rank and sparse structure of neural networks to reduce the dimension of gradient updates.
A novel strategy is utilized to sparsify the gradients, resulting in low-dimensional, less noisy updates.
Empirical evaluation on natural language processing and computer vision tasks shows that our method outperforms other state-of-the-art baselines.
- Score: 5.14780936727027
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Applying Differentially Private Stochastic Gradient Descent (DPSGD) to
training modern, large-scale neural networks such as transformer-based models
is a challenging task, as the magnitude of noise added to the gradients at each
iteration scales with model dimension, hindering the learning capability
significantly. We propose a unified framework, $\textsf{LSG}$, that fully
exploits the low-rank and sparse structure of neural networks to reduce the
dimension of gradient updates, and hence alleviate the negative impacts of
DPSGD. The gradient updates are first approximated with a pair of low-rank
matrices. Then, a novel strategy is utilized to sparsify the gradients,
resulting in low-dimensional, less noisy updates that are yet capable of
retaining the performance of neural networks. Empirical evaluation on natural
language processing and computer vision tasks shows that our method outperforms
other state-of-the-art baselines.
Related papers
- Occam Gradient Descent [0.0]
Occam Gradient Descent is an algorithm that reduces model size and gradient descent on model weights to minimize fitting error.
Our algorithm is effective in outperforming traditional gradient descent with or without post-train pruning in loss, compute and model size.
We find that neural networks trained with Occam Gradient Descent outperform neural networks trained with gradient descent, as well as Random Forests, in both loss and model size.
arXiv Detail & Related papers (2024-05-30T15:58:22Z) - Take A Shortcut Back: Mitigating the Gradient Vanishing for Training
Spiking Neural Networks [8.667899218289328]
Spiking Neural Network (SNN) is a biologically inspired neural network infrastructure that has recently garnered significant attention.
Training an SNN directly poses a challenge due to the undefined gradient of the firing spike process.
We propose a shortcut back-propagation method in our paper, which advocates for transmitting the gradient directly from the loss to the shallow layers.
arXiv Detail & Related papers (2024-01-09T10:54:41Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Improving Deep Learning Interpretability by Saliency Guided Training [36.782919916001624]
Saliency methods have been widely used to highlight important input features in model predictions.
Most existing methods use backpropagation on a modified gradient function to generate saliency maps.
We introduce a saliency guided training procedure for neural networks to reduce noisy gradients used in predictions.
arXiv Detail & Related papers (2021-11-29T06:05:23Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - Inertial Proximal Deep Learning Alternating Minimization for Efficient
Neutral Network Training [16.165369437324266]
This work develops an improved DLAM by the well-known inertial technique, namely iPDLAM, which predicts a point by linearization of current and last iterates.
Numerical results on real-world datasets are reported to demonstrate the efficiency of our proposed algorithm.
arXiv Detail & Related papers (2021-01-30T16:40:08Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.