Related papers: Inference and Interference: The Role of Clipping, Pruning and Loss Landscapes in Differentially Private Stochastic Gradient Descent

Inference and Interference: The Role of Clipping, Pruning and Loss Landscapes in Differentially Private Stochastic Gradient Descent

URL: http://arxiv.org/abs/2311.06839v1
Date: Sun, 12 Nov 2023 13:31:35 GMT
Title: Inference and Interference: The Role of Clipping, Pruning and Loss Landscapes in Differentially Private Stochastic Gradient Descent
Authors: Lauren Watson, Eric Gan, Mohan Dantam, Baharan Mirzasoleiman, Rik Sarkar
Abstract summary: Differentially private gradient descent (DP-SGD) is known to have poorer training and test performance on large neural networks. We compare the behavior of the two processes separately in early and late epochs. We find that while DP-SGD makes slower progress in early stages, it is the behavior in the later stages that determines the end result.
Score: 13.27004430044574
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Differentially private stochastic gradient descent (DP-SGD) is known to have poorer training and test performance on large neural networks, compared to ordinary stochastic gradient descent (SGD). In this paper, we perform a detailed study and comparison of the two processes and unveil several new insights. By comparing the behavior of the two processes separately in early and late epochs, we find that while DP-SGD makes slower progress in early stages, it is the behavior in the later stages that determines the end result. This separate analysis of the clipping and noise addition steps of DP-SGD shows that while noise introduces errors to the process, gradient descent can recover from these errors when it is not clipped, and clipping appears to have a larger impact than noise. These effects are amplified in higher dimensions (large neural networks), where the loss basin occupies a lower dimensional space. We argue theoretically and using extensive experiments that magnitude pruning can be a suitable dimension reduction technique in this regard, and find that heavy pruning can improve the test accuracy of DPSGD.

Related papers

A Granger-Causal Perspective on Gradient Descent with Application to Pruning [2.8602509244926413]
This article explores the causality aspect of gradient descent. We show that the gradient descent procedure has an implicit granger-causal relationship between the reduction in loss and a change in parameters. We illustrate the significance of the causal approach using the application of Pruning.
arXiv Detail & Related papers (2024-12-04T05:16:48Z)
Differentially Private SGD Without Clipping Bias: An Error-Feedback Approach [62.000948039914135]
Using Differentially Private Gradient Descent with Gradient Clipping (DPSGD-GC) to ensure Differential Privacy (DP) comes at the cost of model performance degradation. We propose a new error-feedback (EF) DP algorithm as an alternative to DPSGD-GC. We establish an algorithm-specific DP analysis for our proposed algorithm, providing privacy guarantees based on R'enyi DP.
arXiv Detail & Related papers (2023-11-24T17:56:44Z)
SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks. We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z)
Improving Differentially Private SGD via Randomly Sparsified Gradients [31.295035726077366]
Differentially private gradient observation (DP-SGD) has been widely adopted in deep learning to provide rigorously defined privacy bound compression. We propose an and utilize RS to strengthen communication cost and strengthen privacy bound compression.
arXiv Detail & Related papers (2021-12-01T21:43:34Z)
Differentially private training of neural networks with Langevin dynamics forcalibrated predictive uncertainty [58.730520380312676]
We show that differentially private gradient descent (DP-SGD) can yield poorly calibrated, overconfident deep learning models. This represents a serious issue for safety-critical applications, e.g. in medical diagnosis.
arXiv Detail & Related papers (2021-07-09T08:14:45Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning [165.47118387176607]
It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. Specifically, we observe the heavy tails of gradient noise in these algorithms.
arXiv Detail & Related papers (2020-10-12T12:00:26Z)
On the Generalization Benefit of Noise in Stochastic Gradient Descent [34.127525925676416]
It has long been argued that minibatch gradient descent can generalize better than large batch gradient descent in deep neural networks. We show that small or moderately large batch sizes can substantially outperform very large batches on the test set.
arXiv Detail & Related papers (2020-06-26T16:18:54Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
The Impact of the Mini-batch Size on the Variance of Gradients in Stochastic Gradient Descent [28.148743710421932]
The mini-batch gradient descent (SGD) algorithm is widely used in training machine learning models. We study SGD dynamics under linear regression and two-layer linear networks, with an easy extension to deeper linear networks.
arXiv Detail & Related papers (2020-04-27T20:06:11Z)
The Break-Even Point on Optimization Trajectories of Deep Neural Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory. We show that using a large learning rate in the initial phase of training reduces the variance of the gradient. We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.