Related papers: DP-AdamBC: Your DP-Adam Is Actually DP-SGD (Unless You Apply Bias Correction)

DP-AdamBC: Your DP-Adam Is Actually DP-SGD (Unless You Apply Bias Correction)

URL: http://arxiv.org/abs/2312.14334v1
Date: Thu, 21 Dec 2023 23:42:00 GMT
Title: DP-AdamBC: Your DP-Adam Is Actually DP-SGD (Unless You Apply Bias Correction)
Authors: Qiaoyue Tang, Frederick Shpilevskiy, Mathias L\'ecuyer
Abstract summary: We propose DP-AdamBC, an optimization algorithm which removes the bias in the second moment estimation and retrieves the expected behaviour of Adam. DP-AdamBC significantly improves the optimization performance of DP-Adam by up 3.5% in final accuracy in image, text, and graph node classification tasks.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Adam optimizer is a popular choice in contemporary deep learning, due to its strong empirical performance. However we observe that in privacy sensitive scenarios, the traditional use of Differential Privacy (DP) with the Adam optimizer leads to sub-optimal performance on several tasks. We find that this performance degradation is due to a DP bias in Adam's second moment estimator, introduced by the addition of independent noise in the gradient computation to enforce DP guarantees. This DP bias leads to a different scaling for low variance parameter updates, that is inconsistent with the behavior of non-private Adam. We propose DP-AdamBC, an optimization algorithm which removes the bias in the second moment estimation and retrieves the expected behaviour of Adam. Empirically, DP-AdamBC significantly improves the optimization performance of DP-Adam by up to 3.5% in final accuracy in image, text, and graph node classification tasks.

Related papers

Memory-Efficient Differentially Private Training with Gradient Random Projection [23.309769734156383]
Differential privacy (DP) protects sensitive data during neural network training.<n>Standard methods like DP-Adam suffer from high memory overhead due to per-sample gradient clipping.<n>We introduce DP-GRAPE (Gradient RAndom ProjEction), a DP training method that significantly reduces memory usage.
arXiv Detail & Related papers (2025-06-18T16:05:09Z)
Technical Report: Full Version of Analyzing and Optimizing Perturbation of DP-SGD Geometrically [7.905629859216635]
We first generalize DP-SGD and theoretically derive the impact of DP noise on the training process. Our analysis reveals that, in terms of a perturbed gradient, only the noise on direction has eminent impact on the model efficiency. We design a geometric strategy GeoDP within the DP framework, which perturbs the direction and the magnitude of a gradient.
arXiv Detail & Related papers (2025-04-08T02:26:10Z)
CAdam: Confidence-Based Optimization for Online Learning [35.84013976735154]
We introduce CAdam, a confidence-based optimization strategy that assesses the consistence between the momentum and the gradient for each parameter dimension before deciding on updates. Our experiments with both synthetic and real-world datasets demonstrate that CAdam surpasses other well-known systems. In large-scale A/B testing within a live recommendation system, CAdam significantly enhances model performance compared to Adam.
arXiv Detail & Related papers (2024-11-29T12:00:27Z)
Deconstructing What Makes a Good Optimizer for Language Models [7.9224468703944115]
We compare several optimization algorithms, including SGD, Adafactor, Adam, and Lion, in the context of autoregressive language modeling. Our findings indicate that, except for SGD, these algorithms all perform comparably both in their optimal performance.
arXiv Detail & Related papers (2024-07-10T18:11:40Z)
Pre-training Differentially Private Models with Limited Public Data [54.943023722114134]
differential privacy (DP) is a prominent method to gauge the degree of security provided to the models. DP is yet not capable of protecting a substantial portion of the data used during the initial pre-training stage. We develop a novel DP continual pre-training strategy using only 10% of public data. Our strategy can achieve DP accuracy of 41.5% on ImageNet-21k, as well as non-DP accuracy of 55.7% and and 60.0% on downstream tasks Places365 and iNaturalist-2021.
arXiv Detail & Related papers (2024-02-28T23:26:27Z)
Differentially Private SGD Without Clipping Bias: An Error-Feedback Approach [62.000948039914135]
Using Differentially Private Gradient Descent with Gradient Clipping (DPSGD-GC) to ensure Differential Privacy (DP) comes at the cost of model performance degradation. We propose a new error-feedback (EF) DP algorithm as an alternative to DPSGD-GC. We establish an algorithm-specific DP analysis for our proposed algorithm, providing privacy guarantees based on R'enyi DP.
arXiv Detail & Related papers (2023-11-24T17:56:44Z)
DP-Adam: Correcting DP Bias in Adam's Second Moment Estimation [0.0]
We observe that the traditional use of DP with the Adam introduces a bias in the second moment estimation, due to the addition of independent noise in the gradient computation. This bias leads to a different scaling for low variance parameter updates, that is inconsistent with the behavior of non-private Adam, and Adam's sign descent interpretation.
arXiv Detail & Related papers (2023-04-21T18:43:37Z)
Make Landscape Flatter in Differentially Private Federated Learning [69.78485792860333]
We propose a novel DPFL algorithm named DP-FedSAM, which leverages gradient perturbation to mitigate the negative impact of DP. Specifically, DP-FedSAM integrates local flatness models with better stability and weight robustness, which results in the small norm of local updates and robustness to DP noise. Our algorithm achieves state-of-the-art (SOTA) performance compared with existing SOTA baselines in DPFL.
arXiv Detail & Related papers (2023-03-20T16:27:36Z)
DP-FP: Differentially Private Forward Propagation for Large Models [2.062295244789704]
We show how to mitigate the performance drop by replacing the Differential Private Gradient Descent with a novel DP Forward-Propagation (DP-FP) Our DP-FP achieves an average accuracy of 91.34% with privacy budgets less than 3, representing a 3.81% performance improvement over the state-of-the-art DP-SGD.
arXiv Detail & Related papers (2021-12-29T07:32:29Z)
Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications. We propose a new method named Adam$+$ (pronounced as Adam-plus) Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z)
Private Stochastic Non-Convex Optimization: Adaptive Algorithms and Tighter Generalization Bounds [72.63031036770425]
We propose differentially private (DP) algorithms for bound non-dimensional optimization. We demonstrate two popular deep learning methods on the empirical advantages over standard gradient methods.
arXiv Detail & Related papers (2020-06-24T06:01:24Z)
AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning. It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights. In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.