Zero redundancy distributed learning with differential privacy
- URL: http://arxiv.org/abs/2311.11822v1
- Date: Mon, 20 Nov 2023 14:58:56 GMT
- Title: Zero redundancy distributed learning with differential privacy
- Authors: Zhiqi Bu, Justin Chiu, Ruixuan Liu, Sheng Zha, George Karypis
- Abstract summary: We develop a new systematic solution, DP-ZeRO, to scale up the trainable DP model size.
Our DP-ZeRO has the potential to train models with arbitrary size and is evaluated on the world's largest DP models.
- Score: 26.89679585840689
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning using large models have achieved great success in a wide range
of domains. However, training these models on billions of parameters is very
challenging in terms of the training speed, memory cost, and communication
efficiency, especially under the privacy-preserving regime with differential
privacy (DP). On the one hand, DP optimization has comparable efficiency to the
standard non-private optimization on a single GPU, but on multiple GPUs,
existing DP distributed learning (such as pipeline parallel) has suffered from
significantly worse efficiency. On the other hand, the Zero Redundancy
Optimizer (ZeRO) is a state-of-the-art solution to the standard distributed
learning, exhibiting excellent training efficiency on large models, but to work
compatibly with DP is technically complicated. In this work, we develop a new
systematic solution, DP-ZeRO, (I) to scale up the trainable DP model size, e.g.
to GPT-100B, (II) to obtain the same computation and communication efficiency
as the standard ZeRO, and (III) to enable mixed-precision DP training. Our
DP-ZeRO, like the standard ZeRO, has the potential to train models with
arbitrary size and is evaluated on the world's largest DP models in terms of
the number of trainable parameters.
Related papers
- Towards Efficient Pareto Set Approximation via Mixture of Experts Based Model Fusion [53.33473557562837]
Solving multi-objective optimization problems for large deep neural networks is a challenging task due to the complexity of the loss landscape and the expensive computational cost.
We propose a practical and scalable approach to solve this problem via mixture of experts (MoE) based model fusion.
By ensembling the weights of specialized single-task models, the MoE module can effectively capture the trade-offs between multiple objectives.
arXiv Detail & Related papers (2024-06-14T07:16:18Z) - Sparsity-Preserving Differentially Private Training of Large Embedding
Models [67.29926605156788]
DP-SGD is a training algorithm that combines differential privacy with gradient descent.
Applying DP-SGD naively to embedding models can destroy gradient sparsity, leading to reduced training efficiency.
We present two new algorithms, DP-FEST and DP-AdaFEST, that preserve gradient sparsity during private training of large embedding models.
arXiv Detail & Related papers (2023-11-14T17:59:51Z) - Equivariant Differentially Private Deep Learning: Why DP-SGD Needs
Sparser Models [7.49320945341034]
We show that small and efficient architecture design can outperform current state-of-the-art models with substantially lower computational requirements.
Our results are a step towards efficient model architectures that make optimal use of their parameters.
arXiv Detail & Related papers (2023-01-30T17:43:47Z) - DPIS: An Enhanced Mechanism for Differentially Private SGD with
Importance Sampling [19.59757201902467]
differential privacy (DP) has become a well-accepted standard for privacy protection, and deep neural networks (DNN) have been immensely successful in machine learning.
A classic mechanism for this purpose is DP-SGD, which is a differentially private version of the gradient descent (SGD) commonly used for training.
We propose DPIS, a novel mechanism for differentially private SGD training that can be used as a drop-in replacement of the core of DP-SGD.
arXiv Detail & Related papers (2022-10-18T07:03:14Z) - Differentially Private Optimization on Large Model at Small Cost [39.93710312222771]
Differentially private (DP) optimization is the standard paradigm to learn large neural networks that are accurate and privacy-preserving.
Existing DP implementations are 2-1000X more costly in time and space complexity than the standard (non-private) training.
We develop a novel Book-Keeping (BK) technique that implements existing DPs (thus achieving the same accuracy) with a substantial improvement on the computational cost.
arXiv Detail & Related papers (2022-09-30T18:38:53Z) - Differentially Private Bias-Term Fine-tuning of Foundation Models [36.55810474925956]
We study the problem of differentially private (DP) fine-tuning of large pre-trained models.
We propose DP-BiTFiT, which matches the state-of-the-art accuracy for DP algorithms and the efficiency of the standard BiTFiT.
On a wide range of tasks, DP-BiTFiT is 230X faster and uses 28X less memory than DP full fine-tuning.
arXiv Detail & Related papers (2022-09-30T18:30:48Z) - Large Scale Transfer Learning for Differentially Private Image
Classification [51.10365553035979]
Differential Privacy (DP) provides a formal framework for training machine learning models with individual example level privacy.
Private training using DP-SGD protects against leakage by injecting noise into individual example gradients.
While this result is quite appealing, the computational cost of training large-scale models with DP-SGD is substantially higher than non-private training.
arXiv Detail & Related papers (2022-05-06T01:22:20Z) - Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states.
Our method is widely applicable to classical DP-based inference.
It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z) - Don't Generate Me: Training Differentially Private Generative Models
with Sinkhorn Divergence [73.14373832423156]
We propose DP-Sinkhorn, a novel optimal transport-based generative method for learning data distributions from private data with differential privacy.
Unlike existing approaches for training differentially private generative models, we do not rely on adversarial objectives.
arXiv Detail & Related papers (2021-11-01T18:10:21Z) - Large Language Models Can Be Strong Differentially Private Learners [70.0317718115406]
Differentially Private (DP) learning has seen limited success for building large deep learning models of text.
We show that this performance drop can be mitigated with the use of large pretrained models.
We propose a memory saving technique that allows clipping in DP-SGD to run without instantiating per-example gradients.
arXiv Detail & Related papers (2021-10-12T01:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.