Characterizing the Training Dynamics of Private Fine-tuning with Langevin diffusion
- URL: http://arxiv.org/abs/2402.18905v2
- Date: Fri, 07 Nov 2025 18:52:10 GMT
- Title: Characterizing the Training Dynamics of Private Fine-tuning with Langevin diffusion
- Authors: Shuqi Ke, Charlie Hou, Sewoong Oh, Giulia Fanti,
- Abstract summary: We show that differentially private full fine-tuning (DP-FFT) can distort pre-trained backbone features based on both theoretical and empirical results.<n>We prove that a sequential fine-tuning strategy can mitigate the feature distortion.
- Score: 37.98959061338993
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We show that differentially private full fine-tuning (DP-FFT) can distort pre-trained backbone features based on both theoretical and empirical results. We identify the cause of the distortion as the misalignment between the pre-trained backbone and the randomly initialized linear head. We prove that a sequential fine-tuning strategy can mitigate the feature distortion: first-linear-probing-then-fine-tuning (DP-LP-FFT). A new approximation scheme allows us to derive approximate upper and lower bounds on the training loss of DP-LP and DP-FFT, in a simple but canonical setting of 2-layer neural networks with ReLU activation. Experiments on real-world datasets and architectures are consistent with our theoretical insights. We also derive new upper bounds for 2-layer linear networks without the approximation. Moreover, our theory suggests a trade-off of privacy budget allocation in multi-phase fine-tuning methods like DP-LP-FFT.
Related papers
- Neural Collapse under Gradient Flow on Shallow ReLU Networks for Orthogonally Separable Data [52.737775129027575]
We show that gradient flow on a two-layer ReLU network for classifying orthogonally separable data provably exhibits Neural Collapse (NC)<n>We reveal the role of the implicit bias of the training dynamics in facilitating the emergence of NC.
arXiv Detail & Related papers (2025-10-24T01:36:19Z) - On the Performance of Differentially Private Optimization with Heavy-Tail Class Imbalance [1.1218431616419589]
We show that, in a stylized model, optimizing with Gradient Descent with differential privacy (DP-GD) suffers when learning low-frequency classes.<n>In particular, DP-AdamBC that removes the DP bias from estimating loss curvature is a crucial component to avoid the ill-condition caused by heavy-tail class imbalance.
arXiv Detail & Related papers (2025-07-14T17:57:08Z) - A Principled Bayesian Framework for Training Binary and Spiking Neural Networks [1.6658912537684454]
Spiking Bayesian Neural Networks (SBNNs) is a variational inference framework that uses posterior noise to train Binary and Spiking Neural Networks with IW-ST.<n>By linking low-bias conditions, vanishing gradients, and the KL term, we enable training of deep residual networks without normalisation.
arXiv Detail & Related papers (2025-05-23T14:33:20Z) - Fixed-Mean Gaussian Processes for Post-hoc Bayesian Deep Learning [11.22428369342346]
We introduce a novel family of sparse variational Gaussian processes (GPs), where the posterior mean is fixed to any continuous function when using a universal kernel.
Specifically, we fix the mean of this GP to the output of the pre-trained DNN, allowing our approach to effectively fit the GP's predictive variances to estimate the prediction uncertainty.
Experimental results demonstrate that FMGP improves both uncertainty estimation and computational efficiency when compared to state-of-the-art methods.
arXiv Detail & Related papers (2024-12-05T14:17:16Z) - Convergence of Implicit Gradient Descent for Training Two-Layer Physics-Informed Neural Networks [4.554284689395686]
implicit gradient descent (IGD) outperforms the common gradient descent (GD) algorithm in handling certain multi-scale problems.<n>We show that IGD converges to a globally optimal solution at a linear convergence rate.
arXiv Detail & Related papers (2024-07-03T06:10:41Z) - Noise-Aware Differentially Private Regression via Meta-Learning [25.14514068630219]
Differential Privacy (DP) is the gold standard for protecting user privacy, but standard DP mechanisms significantly impair performance.
One approach to mitigating this issue is pre-training models on simulated data before DP learning on the private data.
In this work we go a step further, using simulated data to train a meta-learning model that combines the Convolutional Conditional Neural Process (ConvCNP) with an improved functional DP mechanism.
arXiv Detail & Related papers (2024-06-12T18:11:24Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Towards Training Without Depth Limits: Batch Normalization Without
Gradient Explosion [83.90492831583997]
We show that a batch-normalized network can keep the optimal signal propagation properties, but avoid exploding gradients in depth.
We use a Multi-Layer Perceptron (MLP) with linear activations and batch-normalization that provably has bounded depth.
We also design an activation shaping scheme that empirically achieves the same properties for certain non-linear activations.
arXiv Detail & Related papers (2023-10-03T12:35:02Z) - Domain-Aware Fine-Tuning: Enhancing Neural Network Adaptability [4.671615537573023]
Domain-Aware Fine-Tuning (DAFT) is a novel approach that incorporates batch normalization conversion and the integration of linear probing and fine-tuning.
Our method significantly mitigates feature distortion and achieves improved model performance on both in-distribution and out-of-distribution datasets.
arXiv Detail & Related papers (2023-08-15T12:08:43Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Arbitrary Decisions are a Hidden Cost of Differentially Private Training [7.560688419767116]
Mechanisms used in machine learning often aim to guarantee differential privacy (DP) during model training.
Practical DP-ensuring training methods use randomization when fitting model parameters to privacy-sensitive data.
For a given input example, the output predicted by equally-private models depends on the randomness used in training.
arXiv Detail & Related papers (2023-02-28T12:13:43Z) - An Adaptive and Stability-Promoting Layerwise Training Approach for Sparse Deep Neural Network Architecture [0.0]
This work presents a two-stage adaptive framework for developing deep neural network (DNN) architectures that generalize well for a given training data set.
In the first stage, a layerwise training approach is adopted where a new layer is added each time and trained independently by freezing parameters in the previous layers.
We introduce a epsilon-delta stability-promoting concept as a desirable property for a learning algorithm and show that employing manifold regularization yields a epsilon-delta stability-promoting algorithm.
arXiv Detail & Related papers (2022-11-13T09:51:16Z) - Normalized/Clipped SGD with Perturbation for Differentially Private
Non-Convex Optimization [94.06564567766475]
DP-SGD and DP-NSGD mitigate the risk of large models memorizing sensitive training data.
We show that these two algorithms achieve similar best accuracy while DP-NSGD is comparatively easier to tune than DP-SGD.
arXiv Detail & Related papers (2022-06-27T03:45:02Z) - Efficient Private SCO for Heavy-Tailed Data via Averaged Clipping [40.69950711262191]
We consider differentially private convex optimization for heavy-tailed data with the guarantee of being differentially private (DP)
We establish new convergence results and improved complexity bounds for the proposed algorithm called AClipped-dpSGD for constrained and unconstrained convex problems.
arXiv Detail & Related papers (2022-06-27T01:39:15Z) - Sample-Efficient Optimisation with Probabilistic Transformer Surrogates [66.98962321504085]
This paper investigates the feasibility of employing state-of-the-art probabilistic transformers in Bayesian optimisation.
We observe two drawbacks stemming from their training procedure and loss definition, hindering their direct deployment as proxies in black-box optimisation.
We introduce two components: 1) a BO-tailored training prior supporting non-uniformly distributed points, and 2) a novel approximate posterior regulariser trading-off accuracy and input sensitivity to filter favourable stationary points for improved predictive performance.
arXiv Detail & Related papers (2022-05-27T11:13:17Z) - What You See is What You Get: Distributional Generalization for
Algorithm Design in Deep Learning [12.215964287323876]
We investigate and leverage a connection between Differential Privacy (DP) and the notion of Distributional Generalization (DG)
We introduce new conceptual tools for designing deep-learning methods that bypass "pathologies" of standard gradient descent (SGD)
arXiv Detail & Related papers (2022-04-07T05:41:40Z) - Fine-Tuning can Distort Pretrained Features and Underperform
Out-of-Distribution [100.01469697743322]
Fine-tuning can achieve worse accuracy than linear probing when the pretrained features are good and the distribution shift is large.
We show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting.
Our analysis suggests that the easy two-step strategy of linear probing then full fine-tuning combines the benefits of both fine-tuning and linear probing.
arXiv Detail & Related papers (2022-02-21T09:03:34Z) - Optimization-Based Separations for Neural Networks [57.875347246373956]
We show that gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations.
This is the first optimization-based separation result where the approximation benefits of the stronger architecture provably manifest in practice.
arXiv Detail & Related papers (2021-12-04T18:07:47Z) - On the Practicality of Differential Privacy in Federated Learning by
Tuning Iteration Times [51.61278695776151]
Federated Learning (FL) is well known for its privacy protection when training machine learning models among distributed clients collaboratively.
Recent studies have pointed out that the naive FL is susceptible to gradient leakage attacks.
Differential Privacy (DP) emerges as a promising countermeasure to defend against gradient leakage attacks.
arXiv Detail & Related papers (2021-01-11T19:43:12Z) - Align, then memorise: the dynamics of learning with feedback alignment [12.587037358391418]
Direct Feedback Alignment (DFA) is an efficient alternative to the ubiquitous backpropagation algorithm for training deep neural networks.
DFA successfully trains state-of-the-art models such as Transformers, but it notoriously fails to train convolutional networks.
Here, we propose a theory for the success of DFA.
arXiv Detail & Related papers (2020-11-24T22:21:27Z) - Private Stochastic Non-Convex Optimization: Adaptive Algorithms and
Tighter Generalization Bounds [72.63031036770425]
We propose differentially private (DP) algorithms for bound non-dimensional optimization.
We demonstrate two popular deep learning methods on the empirical advantages over standard gradient methods.
arXiv Detail & Related papers (2020-06-24T06:01:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.