Inference and Interference: The Role of Clipping, Pruning and Loss
Landscapes in Differentially Private Stochastic Gradient Descent
- URL: http://arxiv.org/abs/2311.06839v1
- Date: Sun, 12 Nov 2023 13:31:35 GMT
- Title: Inference and Interference: The Role of Clipping, Pruning and Loss
Landscapes in Differentially Private Stochastic Gradient Descent
- Authors: Lauren Watson, Eric Gan, Mohan Dantam, Baharan Mirzasoleiman, Rik
Sarkar
- Abstract summary: Differentially private gradient descent (DP-SGD) is known to have poorer training and test performance on large neural networks.
We compare the behavior of the two processes separately in early and late epochs.
We find that while DP-SGD makes slower progress in early stages, it is the behavior in the later stages that determines the end result.
- Score: 13.27004430044574
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Differentially private stochastic gradient descent (DP-SGD) is known to have
poorer training and test performance on large neural networks, compared to
ordinary stochastic gradient descent (SGD). In this paper, we perform a
detailed study and comparison of the two processes and unveil several new
insights. By comparing the behavior of the two processes separately in early
and late epochs, we find that while DP-SGD makes slower progress in early
stages, it is the behavior in the later stages that determines the end result.
This separate analysis of the clipping and noise addition steps of DP-SGD shows
that while noise introduces errors to the process, gradient descent can recover
from these errors when it is not clipped, and clipping appears to have a larger
impact than noise. These effects are amplified in higher dimensions (large
neural networks), where the loss basin occupies a lower dimensional space. We
argue theoretically and using extensive experiments that magnitude pruning can
be a suitable dimension reduction technique in this regard, and find that heavy
pruning can improve the test accuracy of DPSGD.
Related papers
- Differentially Private SGD Without Clipping Bias: An Error-Feedback Approach [62.000948039914135]
Using Differentially Private Gradient Descent with Gradient Clipping (DPSGD-GC) to ensure Differential Privacy (DP) comes at the cost of model performance degradation.
We propose a new error-feedback (EF) DP algorithm as an alternative to DPSGD-GC.
We establish an algorithm-specific DP analysis for our proposed algorithm, providing privacy guarantees based on R'enyi DP.
arXiv Detail & Related papers (2023-11-24T17:56:44Z) - SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks.
We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z) - Improving Differentially Private SGD via Randomly Sparsified Gradients [31.295035726077366]
Differentially private gradient observation (DP-SGD) has been widely adopted in deep learning to provide rigorously defined privacy bound compression.
We propose an and utilize RS to strengthen communication cost and strengthen privacy bound compression.
arXiv Detail & Related papers (2021-12-01T21:43:34Z) - Differentially private training of neural networks with Langevin
dynamics forcalibrated predictive uncertainty [58.730520380312676]
We show that differentially private gradient descent (DP-SGD) can yield poorly calibrated, overconfident deep learning models.
This represents a serious issue for safety-critical applications, e.g. in medical diagnosis.
arXiv Detail & Related papers (2021-07-09T08:14:45Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM
in Deep Learning [165.47118387176607]
It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed.
Specifically, we observe the heavy tails of gradient noise in these algorithms.
arXiv Detail & Related papers (2020-10-12T12:00:26Z) - On the Generalization Benefit of Noise in Stochastic Gradient Descent [34.127525925676416]
It has long been argued that minibatch gradient descent can generalize better than large batch gradient descent in deep neural networks.
We show that small or moderately large batch sizes can substantially outperform very large batches on the test set.
arXiv Detail & Related papers (2020-06-26T16:18:54Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - The Impact of the Mini-batch Size on the Variance of Gradients in
Stochastic Gradient Descent [28.148743710421932]
The mini-batch gradient descent (SGD) algorithm is widely used in training machine learning models.
We study SGD dynamics under linear regression and two-layer linear networks, with an easy extension to deeper linear networks.
arXiv Detail & Related papers (2020-04-27T20:06:11Z) - The Break-Even Point on Optimization Trajectories of Deep Neural
Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory.
We show that using a large learning rate in the initial phase of training reduces the variance of the gradient.
We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.