Related papers: On Design Principles for Private Adaptive Optimizers

On Design Principles for Private Adaptive Optimizers

URL: http://arxiv.org/abs/2507.01129v1
Date: Tue, 01 Jul 2025 18:44:35 GMT
Title: On Design Principles for Private Adaptive Optimizers
Authors: Arun Ganesh, Brendan McMahan, Abhradeep Thakurta,
Abstract summary: Spherical noise added to gradients in differentially private (DP) training undermines the performance of adaptives like Ada and Adam.<n>Recent works have proposed algorithms to address this challenge.
Score: 11.368374833874313
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The spherical noise added to gradients in differentially private (DP) training undermines the performance of adaptive optimizers like AdaGrad and Adam, and hence many recent works have proposed algorithms to address this challenge. However, the empirical results in these works focus on simple tasks and models and the conclusions may not generalize to model training in practice. In this paper we survey several of these variants, and develop better theoretical intuition for them as well as perform empirical studies comparing them. We find that a common intuition of aiming for unbiased estimates of second moments of gradients in adaptive optimizers is misguided, and instead that a simple technique called scale-then-privatize (which does not achieve unbiased second moments) has more desirable theoretical behaviors and outperforms all other variants we study on a small-scale language model training task. We additionally argue that scale-then-privatize causes the noise addition to better match the application of correlated noise mechanisms which are more desirable to use in practice.

Related papers

Denoising Pre-Training and Customized Prompt Learning for Efficient Multi-Behavior Sequential Recommendation [69.60321475454843]
We propose DPCPL, the first pre-training and prompt-tuning paradigm tailored for Multi-Behavior Sequential Recommendation. In the pre-training stage, we propose a novel Efficient Behavior Miner (EBM) to filter out the noise at multiple time scales. Subsequently, we propose to tune the pre-trained model in a highly efficient manner with the proposed Customized Prompt Learning (CPL) module.
arXiv Detail & Related papers (2024-08-21T06:48:38Z)
Optimizing Adaptive Experiments: A Unified Approach to Regret Minimization and Best-Arm Identification [9.030753181146176]
We propose a unified model that simultaneously accounts for within-experiment performance and post-experiment outcomes. We show that substantial reductions in experiment duration can often be achieved with minimal impact on both within-experiment and post-experiment regret.
arXiv Detail & Related papers (2024-02-16T11:27:48Z)
Boosting Fair Classifier Generalization through Adaptive Priority Reweighing [59.801444556074394]
A performance-promising fair algorithm with better generalizability is needed. This paper proposes a novel adaptive reweighing method to eliminate the impact of the distribution shifts between training and test data on model generalizability.
arXiv Detail & Related papers (2023-09-15T13:04:55Z)
An Empirical Analysis of Parameter-Efficient Methods for Debiasing Pre-Trained Language Models [55.14405248920852]
We conduct experiments with prefix tuning, prompt tuning, and adapter tuning on different language models and bias types to evaluate their debiasing performance. We find that the parameter-efficient methods are effective in mitigating gender bias, where adapter tuning is consistently the most effective. We also find that prompt tuning is more suitable for GPT-2 than BERT, and racial and religious bias is less effective when it comes to racial and religious bias.
arXiv Detail & Related papers (2023-06-06T23:56:18Z)
Towards Compute-Optimal Transfer Learning [82.88829463290041]
We argue that zero-shot structured pruning of pretrained models allows them to increase compute efficiency with minimal reduction in performance. Our results show that pruning convolutional filters of pretrained models can lead to more than 20% performance improvement in low computational regimes.
arXiv Detail & Related papers (2023-04-25T21:49:09Z)
Empirical Study on Optimizer Selection for Out-of-Distribution Generalization [16.386766049451317]
Modern deep learning systems do not generalize well when the test data distribution is slightly different to the training data distribution. In this study, we examine the performance of popular first-order generalizations for different classes of distributional shift.
arXiv Detail & Related papers (2022-11-15T23:56:30Z)
Improving Pre-trained Language Model Fine-tuning with Noise Stability Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR) Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model. We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z)
AdaTerm: Adaptive T-Distribution Estimated Robust Moments for Noise-Robust Stochastic Gradient Optimization [14.531550983885772]
We propose AdaTerm, a novel approach that incorporates the Student's t-distribution to derive not only the first-order moment but also all associated statistics. This provides a unified treatment of the optimization process, offering a comprehensive framework under the statistical model of the t-distribution for the first time.
arXiv Detail & Related papers (2022-01-18T03:13:19Z)
Adaptive Optimization with Examplewise Gradients [23.504973357538418]
We propose a new, more general approach to the design of gradient-based optimization methods for machine learning. In this new framework, iterations assume access to a batch of estimates per parameter, rather than a single estimate. This better reflects the information that is actually available in typical machine learning setups.
arXiv Detail & Related papers (2021-11-30T23:37:01Z)
Adapting by Pruning: A Case Study on BERT [9.963251767416967]
We propose a novel model adaptation paradigm, adapting by pruning, which prunes neural connections in the pre-trained model to optimise the performance on the target task. We formulate adapting-by-pruning as an optimisation problem with a differentiable loss and propose an efficient algorithm to prune the model. Results suggest that our method can prune up to 50% weights in BERT while yielding similar performance compared to the fine-tuned full model.
arXiv Detail & Related papers (2021-05-07T15:51:08Z)
Adaptive Sampling for Minimax Fair Classification [40.936345085421955]
We propose an adaptive sampling algorithm based on the principle of optimism, and derive theoretical bounds on its performance. By deriving algorithm independent lower-bounds for a specific class of problems, we show that the performance achieved by our adaptive scheme cannot be improved in general.
arXiv Detail & Related papers (2021-03-01T04:58:27Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.