DPFormer: Learning Differentially Private Transformer on Long-Tailed
Data
- URL: http://arxiv.org/abs/2305.17633v1
- Date: Sun, 28 May 2023 05:00:07 GMT
- Title: DPFormer: Learning Differentially Private Transformer on Long-Tailed
Data
- Authors: Youlong Ding, Xueyang Wu, Hao Wang and Weike Pan
- Abstract summary: The Transformer has emerged as a versatile and effective architecture with broad applications.
It still remains an open problem how to efficiently train a Transformer model of high utility with differential privacy guarantees.
In this paper, we identify two key challenges in learning differentially private Transformers, i.e., heavy computation overhead due to per-sample gradient clipping and unintentional attention distraction within the attention mechanism.
We propose DPFormer, equipped with Phantom Clipping and Re-Attention Mechanism, to address these challenges.
- Score: 6.848321493051996
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Transformer has emerged as a versatile and effective architecture with
broad applications. However, it still remains an open problem how to
efficiently train a Transformer model of high utility with differential privacy
guarantees. In this paper, we identify two key challenges in learning
differentially private Transformers, i.e., heavy computation overhead due to
per-sample gradient clipping and unintentional attention distraction within the
attention mechanism. In response, we propose DPFormer, equipped with Phantom
Clipping and Re-Attention Mechanism, to address these challenges. Our
theoretical analysis shows that DPFormer can reduce computational costs during
gradient clipping and effectively mitigate attention distraction (which could
obstruct the training process and lead to a significant performance drop,
especially in the presence of long-tailed data). Such analysis is further
corroborated by empirical results on two real-world datasets, demonstrating the
efficiency and effectiveness of the proposed DPFormer.
Related papers
- Dual-Priv Pruning : Efficient Differential Private Fine-Tuning in Multimodal Large Language Models [21.598534853947676]
We propose a framework that employs two complementary pruning mechanisms for Differential Privacy (DP) fine-tuning in MLLMs.<n>Our approach consistently utilizes less memory than standard DP-SGD.<n>To the best of our knowledge, we are the first to explore DP fine-tuning in MLLMs.
arXiv Detail & Related papers (2025-06-08T10:33:01Z) - Transformer Meets Twicing: Harnessing Unattended Residual Information [2.1605931466490795]
Transformer-based deep learning models have achieved state-of-the-art performance across numerous language and vision tasks.
While the self-attention mechanism has proven capable of handling complex data patterns, it has been observed that the representational capacity of the attention matrix degrades significantly across transformer layers.
We propose the Twicing Attention, a novel attention mechanism that uses kernel twicing procedure in nonparametric regression to alleviate the low-pass behavior of associated NLM smoothing.
arXiv Detail & Related papers (2025-03-02T01:56:35Z) - FLARES: Fast and Accurate LiDAR Multi-Range Semantic Segmentation [52.89847760590189]
3D scene understanding is a critical yet challenging task in autonomous driving.
Recent methods leverage the range-view representation to improve processing efficiency.
We re-design the workflow for range-view-based LiDAR semantic segmentation.
arXiv Detail & Related papers (2025-02-13T12:39:26Z) - BEExformer: A Fast Inferencing Binarized Transformer with Early Exits [2.7651063843287718]
We introduce Binarized Early Exit Transformer (BEExformer), the first-ever selective learning-based transformer integrating Binarization-Aware Training (BAT) with Early Exit (EE)<n>BAT employs a differentiable second-order approximation to the sign function, enabling gradient that captures both the sign and magnitude of the weights.<n>EE mechanism hinges on fractional reduction in entropy among intermediate transformer blocks with soft-routing loss estimation.<n>This accelerates inference by reducing FLOPs by 52.08% and even improves accuracy by 2.89% by resolving the "overthinking" problem inherent in deep networks.
arXiv Detail & Related papers (2024-12-06T17:58:14Z) - CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism.
We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies.
By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z) - DiSK: Differentially Private Optimizer with Simplified Kalman Filter for Noise Reduction [57.83978915843095]
This paper introduces DiSK, a novel framework designed to significantly enhance the performance of differentially private gradients.
To ensure practicality for large-scale training, we simplify the Kalman filtering process, minimizing its memory and computational demands.
arXiv Detail & Related papers (2024-10-04T19:30:39Z) - Delving into Differentially Private Transformer [7.474126823543351]
This paper delves into the problem of training Transformer models with differential privacy.
Our treatment is modular: the logic is to reduce' the problem of training DP Transformer to the more basic problem of training DP vanilla neural nets.
arXiv Detail & Related papers (2024-05-28T14:04:09Z) - DiffsFormer: A Diffusion Transformer on Stock Factor Augmentation [36.75453713794983]
We introduce the Diffusion Model to generate stock factors with Transformer architecture (DiffsFormer)
When presented with a specific downstream task, we employ DiffsFormer to augment the training procedure by editing existing samples.
The proposed method achieves relative improvements of 7.2% and 27.8% in annualized return ratio for the respective datasets.
arXiv Detail & Related papers (2024-02-05T03:54:36Z) - AGaLiTe: Approximate Gated Linear Transformers for Online Reinforcement Learning [7.886461196772644]
We introduce alternatives to the transformer self-attention mechanism that offer context-independent inference cost.
Compared with a state-of-the-art architecture, GTrXL, inference in our approach is at least 40% cheaper while reducing memory use more than 50%.
arXiv Detail & Related papers (2023-10-24T10:51:50Z) - Selectivity Drives Productivity: Efficient Dataset Pruning for Enhanced
Transfer Learning [66.20311762506702]
dataset pruning (DP) has emerged as an effective way to improve data efficiency.
We propose two new DP methods, label mapping and feature mapping, for supervised and self-supervised pretraining settings.
We show that source data classes can be pruned by up to 40% 80% without sacrificing downstream performance.
arXiv Detail & Related papers (2023-10-13T00:07:49Z) - Leveraging the Power of Data Augmentation for Transformer-based Tracking [64.46371987827312]
We propose two data augmentation methods customized for tracking.
First, we optimize existing random cropping via a dynamic search radius mechanism and simulation for boundary samples.
Second, we propose a token-level feature mixing augmentation strategy, which enables the model against challenges like background interference.
arXiv Detail & Related papers (2023-09-15T09:18:54Z) - Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable
Transformers [107.3726071306935]
We propose a new plug-and-play training framework, SMoE-Dropout, to enable scaling transformers to better accuracy in their full capacity without collapse.
SMoE-Dropout consists of a randomly and fixed router network to activate experts and gradually increases the activated expert number as training progresses over time.
Our experiments demonstrate the superior performance and substantial computation savings of SMoE-Dropout, compared to dense training baselines with equivalent parameter counts.
arXiv Detail & Related papers (2023-03-02T22:12:51Z) - Analyzing the Performance of Deep Encoder-Decoder Networks as Surrogates
for a Diffusion Equation [0.0]
We study the use of encoder-decoder convolutional neural network (CNN) as surrogates for steady-state diffusion solvers.
Our results indicate that increasing the size of the training set has a substantial effect on reducing performance fluctuations and overall error.
arXiv Detail & Related papers (2023-02-07T22:53:19Z) - Large Language Models Can Be Strong Differentially Private Learners [70.0317718115406]
Differentially Private (DP) learning has seen limited success for building large deep learning models of text.
We show that this performance drop can be mitigated with the use of large pretrained models.
We propose a memory saving technique that allows clipping in DP-SGD to run without instantiating per-example gradients.
arXiv Detail & Related papers (2021-10-12T01:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.