Related papers: On the accuracy and efficiency of group-wise clipping in differentially private optimization

On the accuracy and efficiency of group-wise clipping in differentially private optimization

URL: http://arxiv.org/abs/2310.19215v1
Date: Mon, 30 Oct 2023 01:01:15 GMT
Title: On the accuracy and efficiency of group-wise clipping in differentially private optimization
Authors: Zhiqi Bu, Ruixuan Liu, Yu-Xiang Wang, Sheng Zha, George Karypis
Abstract summary: We show that different clipping styles have the same time complexity but instantiate an accuracy-memory trade-off. We demonstrate that the accuracy gap between group-wise clipping and all-layer clipping becomes smaller for larger models.
Score: 38.80873569002277
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances have substantially improved the accuracy, memory cost, and training speed of differentially private (DP) deep learning, especially on large vision and language models with millions to billions of parameters. In this work, we thoroughly study the per-sample gradient clipping style, a key component in DP optimization. We show that different clipping styles have the same time complexity but instantiate an accuracy-memory trade-off: while the all-layer clipping (of coarse granularity) is the most prevalent and usually gives the best accuracy, it incurs heavier memory cost compared to other group-wise clipping, such as the layer-wise clipping (of finer granularity). We formalize this trade-off through our convergence theory and complexity analysis. Importantly, we demonstrate that the accuracy gap between group-wise clipping and all-layer clipping becomes smaller for larger models, while the memory advantage of the group-wise clipping remains. Consequently, the group-wise clipping allows DP optimization of large models to achieve high accuracy and low peak memory simultaneously.

Related papers

HOT: Hadamard-based Optimized Training [7.193483612237862]
It has become increasingly important to optimize backpropagation to reduce memory usage and computational overhead. In this paper, we focus on matrix multiplication, which accounts for the largest portion of training costs, and analyze its backpropagation in detail. Based on this analysis, we introduce a novel method, Hadamard-based Optimized Training. In this approach, we apply Hadamard-based optimizations, such as Hadamard quantization and Hadamard low-rank approximation. Our extensive analysis shows that HOT achieves up to 75% memory savings and a 2.6 times acceleration on real
arXiv Detail & Related papers (2025-03-27T08:37:24Z)
Nearly Lossless Adaptive Bit Switching [8.485009775430411]
Experimental results on the ImageNet-1K classification demonstrate that our methods have enough advantages to state-of-the-art one-shot joint QAT in both multi-precision and mixed-precision.
arXiv Detail & Related papers (2025-02-03T09:46:26Z)
XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference [9.65524177141491]
Large Language Model (LLM) inference generates output tokens one-by-one, leading to many redundant computations. KV-Cache framework makes a compromise between time and space complexities. Existing studies reduce memory consumption by evicting some of cached data that have less important impact on inference accuracy. We show that customizing the cache size for each layer in a personalized manner can yield a significant memory reduction.
arXiv Detail & Related papers (2024-12-08T11:32:08Z)
MPruner: Optimizing Neural Network Size with CKA-Based Mutual Information Pruning [7.262751938473306]
Pruning is a well-established technique that reduces the size of neural networks while mathematically guaranteeing accuracy preservation. We develop a new pruning algorithm, MPruner, that leverages mutual information through vector similarity. MPruner achieved up to a 50% reduction in parameters and memory usage for CNN and transformer-based models, with minimal to no loss in accuracy.
arXiv Detail & Related papers (2024-08-24T05:54:47Z)
AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models. AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Exploring the Limits of Differentially Private Deep Learning with Group-wise Clipping [91.60608388479645]
We show that emphper-layer clipping allows clipping to be performed in conjunction with backpropagation in differentially private optimization. This results in private learning that is as memory-efficient and almost as fast per training update as non-private learning for many of interest.
arXiv Detail & Related papers (2022-12-03T05:20:15Z)
Per-Clip Video Object Segmentation [110.08925274049409]
Recently, memory-based approaches show promising results on semisupervised video object segmentation. We treat video object segmentation as clip-wise mask-wise propagation. We propose a new method tailored for the per-clip inference.
arXiv Detail & Related papers (2022-08-03T09:02:29Z)
Automatic Clipping: Differentially Private Deep Learning Made Easier and Stronger [39.93710312222771]
Per-example clipping is a key algorithmic step that enables practical differential private (DP) training for deep learning models. We propose an easy-to-use replacement, called automatic clipping, that eliminates the need to tune R for any DPs.
arXiv Detail & Related papers (2022-06-14T19:49:44Z)
On the Convergence and Calibration of Deep Learning with Differential Privacy [12.297499996547925]
Differentially private (DP) training preserves the data privacy usually at the cost of slower convergence. We show that noise addition only affects the privacy risk but not the convergence or calibration. In sharp contrast, DP models trained with large clipping norm enjoy the same privacy guarantee and similar accuracy, but are significantly more textitd
arXiv Detail & Related papers (2021-06-15T01:32:29Z)
Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration. This enables the learning of long-range dependencies beyond a single clip. Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.