Scaling up Differentially Private Deep Learning with Fast Per-Example
Gradient Clipping
- URL: http://arxiv.org/abs/2009.03106v1
- Date: Mon, 7 Sep 2020 13:51:26 GMT
- Title: Scaling up Differentially Private Deep Learning with Fast Per-Example
Gradient Clipping
- Authors: Jaewoo Lee and Daniel Kifer
- Abstract summary: Recent work on Differential Privacy has shown the feasibility of applying differential privacy to deep learning tasks.
Despite their promise, differentially private deep networks often lag far behind their non-private counterparts in accuracy.
One of the barriers to this expanded research is the training time -- often orders of magnitude larger than training non-private networks.
- Score: 15.410557873153833
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work on Renyi Differential Privacy has shown the feasibility of
applying differential privacy to deep learning tasks. Despite their promise,
however, differentially private deep networks often lag far behind their
non-private counterparts in accuracy, showing the need for more research in
model architectures, optimizers, etc. One of the barriers to this expanded
research is the training time -- often orders of magnitude larger than training
non-private networks. The reason for this slowdown is a crucial privacy-related
step called "per-example gradient clipping" whose naive implementation undoes
the benefits of batch training with GPUs. By analyzing the back-propagation
equations we derive new methods for per-example gradient clipping that are
compatible with auto-differentiation (e.g., in PyTorch and TensorFlow) and
provide better GPU utilization. Our implementation in PyTorch showed
significant training speed-ups (by factors of 54x - 94x for training various
models with batch sizes of 128). These techniques work for a variety of
architectural choices including convolutional layers, recurrent networks,
attention, residual blocks, etc.
Related papers
- Stepping Forward on the Last Mile [8.756033984943178]
We propose a series of algorithm enhancements that further reduce the memory footprint, and the accuracy gap compared to backpropagation.
Our results demonstrate that on the last mile of model customization on edge devices, training with fixed-point forward gradients is a feasible and practical approach.
arXiv Detail & Related papers (2024-11-06T16:33:21Z) - Equivariant Differentially Private Deep Learning: Why DP-SGD Needs
Sparser Models [7.49320945341034]
We show that small and efficient architecture design can outperform current state-of-the-art models with substantially lower computational requirements.
Our results are a step towards efficient model architectures that make optimal use of their parameters.
arXiv Detail & Related papers (2023-01-30T17:43:47Z) - Exploring the Limits of Differentially Private Deep Learning with
Group-wise Clipping [91.60608388479645]
We show that emphper-layer clipping allows clipping to be performed in conjunction with backpropagation in differentially private optimization.
This results in private learning that is as memory-efficient and almost as fast per training update as non-private learning for many of interest.
arXiv Detail & Related papers (2022-12-03T05:20:15Z) - Fine-Tuning with Differential Privacy Necessitates an Additional
Hyperparameter Search [38.83524780461911]
We show how carefully selecting the layers being fine-tuned in the pretrained neural network allows us to establish new state-of-the-art tradeoffs between privacy and accuracy.
We achieve 77.9% accuracy for $(varepsilon, delta)= (2, 10-5)$ on CIFAR-100 for a model pretrained on ImageNet.
arXiv Detail & Related papers (2022-10-05T11:32:49Z) - Training Your Sparse Neural Network Better with Any Mask [106.134361318518]
Pruning large neural networks to create high-quality, independently trainable sparse masks is desirable.
In this paper we demonstrate an alternative opportunity: one can customize the sparse training techniques to deviate from the default dense network training protocols.
Our new sparse training recipe is generally applicable to improving training from scratch with various sparse masks.
arXiv Detail & Related papers (2022-06-26T00:37:33Z) - Large Scale Transfer Learning for Differentially Private Image
Classification [51.10365553035979]
Differential Privacy (DP) provides a formal framework for training machine learning models with individual example level privacy.
Private training using DP-SGD protects against leakage by injecting noise into individual example gradients.
While this result is quite appealing, the computational cost of training large-scale models with DP-SGD is substantially higher than non-private training.
arXiv Detail & Related papers (2022-05-06T01:22:20Z) - APP: Anytime Progressive Pruning [104.36308667437397]
We propose a novel way of training a neural network with a target sparsity in a particular case of online learning: the anytime learning at macroscale paradigm (ALMA)
The proposed approach significantly outperforms the baseline dense and Anytime OSP models across multiple architectures and datasets under short, moderate, and long-sequence training.
arXiv Detail & Related papers (2022-04-04T16:38:55Z) - Don't Generate Me: Training Differentially Private Generative Models
with Sinkhorn Divergence [73.14373832423156]
We propose DP-Sinkhorn, a novel optimal transport-based generative method for learning data distributions from private data with differential privacy.
Unlike existing approaches for training differentially private generative models, we do not rely on adversarial objectives.
arXiv Detail & Related papers (2021-11-01T18:10:21Z) - Differentially Private Deep Learning with Direct Feedback Alignment [15.410557873153833]
We propose the first differentially private method for training deep neural networks with direct feedback alignment (DFA)
DFA achieves significant gains in accuracy (often by 10-20%) compared to backprop-based differentially private training on a variety of architectures.
arXiv Detail & Related papers (2020-10-08T00:25:22Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.