DPFormer: Learning Differentially Private Transformer on Long-Tailed
Data
- URL: http://arxiv.org/abs/2305.17633v1
- Date: Sun, 28 May 2023 05:00:07 GMT
- Title: DPFormer: Learning Differentially Private Transformer on Long-Tailed
Data
- Authors: Youlong Ding, Xueyang Wu, Hao Wang and Weike Pan
- Abstract summary: The Transformer has emerged as a versatile and effective architecture with broad applications.
It still remains an open problem how to efficiently train a Transformer model of high utility with differential privacy guarantees.
In this paper, we identify two key challenges in learning differentially private Transformers, i.e., heavy computation overhead due to per-sample gradient clipping and unintentional attention distraction within the attention mechanism.
We propose DPFormer, equipped with Phantom Clipping and Re-Attention Mechanism, to address these challenges.
- Score: 6.848321493051996
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Transformer has emerged as a versatile and effective architecture with
broad applications. However, it still remains an open problem how to
efficiently train a Transformer model of high utility with differential privacy
guarantees. In this paper, we identify two key challenges in learning
differentially private Transformers, i.e., heavy computation overhead due to
per-sample gradient clipping and unintentional attention distraction within the
attention mechanism. In response, we propose DPFormer, equipped with Phantom
Clipping and Re-Attention Mechanism, to address these challenges. Our
theoretical analysis shows that DPFormer can reduce computational costs during
gradient clipping and effectively mitigate attention distraction (which could
obstruct the training process and lead to a significant performance drop,
especially in the presence of long-tailed data). Such analysis is further
corroborated by empirical results on two real-world datasets, demonstrating the
efficiency and effectiveness of the proposed DPFormer.
Related papers
- Delving into Differentially Private Transformer [7.474126823543351]
This paper delves into the problem of training Transformer models with differential privacy.
Our treatment is modular: the logic is to reduce' the problem of training DP Transformer to the more basic problem of training DP vanilla neural nets.
arXiv Detail & Related papers (2024-05-28T14:04:09Z) - DiffsFormer: A Diffusion Transformer on Stock Factor Augmentation [36.75453713794983]
We introduce the Diffusion Model to generate stock factors with Transformer architecture (DiffsFormer)
When presented with a specific downstream task, we employ DiffsFormer to augment the training procedure by editing existing samples.
The proposed method achieves relative improvements of 7.2% and 27.8% in annualized return ratio for the respective datasets.
arXiv Detail & Related papers (2024-02-05T03:54:36Z) - PREM: A Simple Yet Effective Approach for Node-Level Graph Anomaly
Detection [65.24854366973794]
Node-level graph anomaly detection (GAD) plays a critical role in identifying anomalous nodes from graph-structured data in domains such as medicine, social networks, and e-commerce.
We introduce a simple method termed PREprocessing and Matching (PREM for short) to improve the efficiency of GAD.
Our approach streamlines GAD, reducing time and memory consumption while maintaining powerful anomaly detection capabilities.
arXiv Detail & Related papers (2023-10-18T02:59:57Z) - Selectivity Drives Productivity: Efficient Dataset Pruning for Enhanced
Transfer Learning [66.20311762506702]
dataset pruning (DP) has emerged as an effective way to improve data efficiency.
We propose two new DP methods, label mapping and feature mapping, for supervised and self-supervised pretraining settings.
We show that source data classes can be pruned by up to 40% 80% without sacrificing downstream performance.
arXiv Detail & Related papers (2023-10-13T00:07:49Z) - Leveraging the Power of Data Augmentation for Transformer-based Tracking [64.46371987827312]
We propose two data augmentation methods customized for tracking.
First, we optimize existing random cropping via a dynamic search radius mechanism and simulation for boundary samples.
Second, we propose a token-level feature mixing augmentation strategy, which enables the model against challenges like background interference.
arXiv Detail & Related papers (2023-09-15T09:18:54Z) - Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable
Transformers [107.3726071306935]
We propose a new plug-and-play training framework, SMoE-Dropout, to enable scaling transformers to better accuracy in their full capacity without collapse.
SMoE-Dropout consists of a randomly and fixed router network to activate experts and gradually increases the activated expert number as training progresses over time.
Our experiments demonstrate the superior performance and substantial computation savings of SMoE-Dropout, compared to dense training baselines with equivalent parameter counts.
arXiv Detail & Related papers (2023-03-02T22:12:51Z) - Analyzing the Performance of Deep Encoder-Decoder Networks as Surrogates
for a Diffusion Equation [0.0]
We study the use of encoder-decoder convolutional neural network (CNN) as surrogates for steady-state diffusion solvers.
Our results indicate that increasing the size of the training set has a substantial effect on reducing performance fluctuations and overall error.
arXiv Detail & Related papers (2023-02-07T22:53:19Z) - Multi-dataset Training of Transformers for Robust Action Recognition [75.5695991766902]
We study the task of robust feature representations, aiming to generalize well on multiple datasets for action recognition.
Here, we propose a novel multi-dataset training paradigm, MultiTrain, with the design of two new loss terms, namely informative loss and projection loss.
We verify the effectiveness of our method on five challenging datasets, Kinetics-400, Kinetics-700, Moments-in-Time, Activitynet and Something-something-v2.
arXiv Detail & Related papers (2022-09-26T01:30:43Z) - Large Language Models Can Be Strong Differentially Private Learners [70.0317718115406]
Differentially Private (DP) learning has seen limited success for building large deep learning models of text.
We show that this performance drop can be mitigated with the use of large pretrained models.
We propose a memory saving technique that allows clipping in DP-SGD to run without instantiating per-example gradients.
arXiv Detail & Related papers (2021-10-12T01:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.