SecFormer: Towards Fast and Accurate Privacy-Preserving Inference for Large Language Models
- URL: http://arxiv.org/abs/2401.00793v3
- Date: Thu, 6 Jun 2024 05:22:44 GMT
- Title: SecFormer: Towards Fast and Accurate Privacy-Preserving Inference for Large Language Models
- Authors: Jinglong Luo, Yehong Zhang, Zhuo Zhang, Jiaqi Zhang, Xin Mu, Hui Wang, Yue Yu, Zenglin Xu,
- Abstract summary: We introduce an advanced optimization framework called SecFormer to achieve fast and accurate PPI for Transformer models.
In terms of efficiency, SecFormer is 3.56 and 3.58 times faster than Puma for BERT$_textBASE$ and BERT$_textLARGE$, respectively.
- Score: 34.63351580241698
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the growing use of large language models hosted on cloud platforms to offer inference services, privacy concerns are escalating, especially concerning sensitive data like investment plans and bank account details. Secure Multi-Party Computing (SMPC) emerges as a promising solution to protect the privacy of inference data and model parameters. However, the application of SMPC in Privacy-Preserving Inference (PPI) for large language models, particularly those based on the Transformer architecture, often leads to considerable slowdowns or declines in performance. This is largely due to the multitude of nonlinear operations in the Transformer architecture, which are not well-suited to SMPC and difficult to circumvent or optimize effectively. To address this concern, we introduce an advanced optimization framework called SecFormer, to achieve fast and accurate PPI for Transformer models. By implementing model design optimization, we successfully eliminate the high-cost exponential and maximum operations in PPI without sacrificing model performance. Additionally, we have developed a suite of efficient SMPC protocols that utilize segmented polynomials, Fourier series and Goldschmidt's method to handle other complex nonlinear functions within PPI, such as GeLU, LayerNorm, and Softmax. Our extensive experiments reveal that SecFormer outperforms MPCFormer in performance, showing improvements of $5.6\%$ and $24.2\%$ for BERT$_{\text{BASE}}$ and BERT$_{\text{LARGE}}$, respectively. In terms of efficiency, SecFormer is 3.56 and 3.58 times faster than Puma for BERT$_{\text{BASE}}$ and BERT$_{\text{LARGE}}$, demonstrating its effectiveness and speed.
Related papers
- Accelerating Private Large Transformers Inference through Fine-grained Collaborative Computation [8.859237832459876]
We present FASTLMPI, a new approach to accelerate private TBM inference through fine-grained optimization.
Specifically, through the fine-grained co-design of homomorphic encryption and secret sharing, FASTLMPI achieves efficient protocols for matrix multiplication, SoftMax, LayerNorm, and GeLULU.
FASTLMPI shows a remarkable 54% to 64% decrease in runtime and an impressive 72.2% reduction in communication costs.
arXiv Detail & Related papers (2024-12-21T08:33:12Z) - MOFHEI: Model Optimizing Framework for Fast and Efficient Homomorphically Encrypted Neural Network Inference [0.8388591755871735]
Homomorphic Encryption (HE) enables us to perform machine learning tasks over encrypted data.
We propose MOFHEI, a framework that optimize the model to make HE-based neural network inference, fast and efficient.
Our framework achieves up to 98% pruning ratio on LeNet, eliminating up to 93% of the required HE operations for performing PI.
arXiv Detail & Related papers (2024-12-10T22:44:54Z) - AERO: Softmax-Only LLMs for Efficient Private Inference [3.7802450241986945]
We present a comprehensive analysis to understand the role of nonlinearities in transformer-based decoder-only language models.
We introduce AERO, a four-step architectural optimization framework that refines the existing LLM architecture for efficient PI.
For the first time, we propose a Softmax-only architecture with significantly fewer FLOPs tailored for efficient PI.
arXiv Detail & Related papers (2024-10-16T21:40:49Z) - Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines [17.539008562641303]
Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers.
Next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data.
Fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands.
arXiv Detail & Related papers (2024-09-23T20:14:09Z) - AdaPI: Facilitating DNN Model Adaptivity for Efficient Private Inference in Edge Computing [20.11448308239082]
AdaPI is a novel approach that achieves adaptive PI by allowing a model to perform well across edge devices with diverse energy budgets.
AdaPI attains optimal accuracy for each energy budget, which outperforms the state-of-the-art PI methods by 7.3% in terms of test accuracy on CIFAR-100.
arXiv Detail & Related papers (2024-07-08T05:58:49Z) - Ditto: Quantization-aware Secure Inference of Transformers upon MPC [5.161569981377991]
We propose the framework named Ditto to enable more efficient quantization-aware secure Transformer inference.
We conduct extensive experiments on Bert and GPT2 models to evaluate the performance of Ditto.
The results demonstrate that Ditto is about $3.14sim 4.40times$ faster than MPCFormer and $1.44sim 2.35times$ faster than the state-of-the-art work PUMA.
arXiv Detail & Related papers (2024-05-09T03:28:16Z) - Improved Communication-Privacy Trade-offs in $L_2$ Mean Estimation under Streaming Differential Privacy [47.997934291881414]
Existing mean estimation schemes are usually optimized for $L_infty$ geometry and rely on random rotation or Kashin's representation to adapt to $L$ geometry.
We introduce a novel privacy accounting method for the sparsified Gaussian mechanism that incorporates the randomness inherent in sparsification into the DP.
Unlike previous approaches, our accounting algorithm directly operates in $L$ geometry, yielding MSEs that fast converge to those of the Gaussian mechanism.
arXiv Detail & Related papers (2024-05-02T03:48:47Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - Private Fine-tuning of Large Language Models with Zeroth-order Optimization [51.19403058739522]
Differentially private gradient descent (DP-SGD) allows models to be trained in a privacy-preserving manner.
We introduce DP-ZO, a private fine-tuning framework for large language models by privatizing zeroth order optimization methods.
arXiv Detail & Related papers (2024-01-09T03:53:59Z) - East: Efficient and Accurate Secure Transformer Framework for Inference [7.887332345182056]
We propose a framework emphEast to enable efficient and accurate secure Transformer inference.
Compared to Iron, we achieve about 1.8$times$ lower communication within 1.2$times$ lower runtime.
arXiv Detail & Related papers (2023-08-19T06:26:14Z) - Efficient Long Sequence Modeling via State Space Augmented Transformer [92.74707853711374]
We propose SPADE, short for $underlinetextbfS$tate sunderlinetextbfP$ace.
We augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers.
Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-15T20:51:27Z) - Efficiently Scaling Transformer Inference [8.196193683641582]
We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings.
We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices.
We achieve a low-batch-size latency of 29ms per token during generation (using int8 weight quantization) and a 76% MFU during large-batch-size processing of input tokens.
arXiv Detail & Related papers (2022-11-09T18:50:38Z) - MPCFormer: fast, performant and private Transformer inference with MPC [64.23599808800738]
We design the framework MPCFORMER using secure multi-party computation (MPC) and Knowledge Distillation (KD)
MPCFORMER significantly speeds up Transformer model inference in MPC settings while achieving similar ML performance to the input model.
We show that MPCFORMER remains effective with different trained Transformer weights such as ROBERTABASE and larger models including BERTLarge.
arXiv Detail & Related papers (2022-11-02T19:43:22Z) - THE-X: Privacy-Preserving Transformer Inference with Homomorphic
Encryption [112.02441503951297]
Privacy-preserving inference of transformer models is on the demand of cloud service users.
We introduce $textitTHE-X$, an approximation approach for transformers, which enables privacy-preserving inference of pre-trained models.
arXiv Detail & Related papers (2022-06-01T03:49:18Z) - CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference.
We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch.
We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z) - A Privacy-Preserving-Oriented DNN Pruning and Mobile Acceleration
Framework [56.57225686288006]
Weight pruning of deep neural networks (DNNs) has been proposed to satisfy the limited storage and computing capability of mobile edge devices.
Previous pruning methods mainly focus on reducing the model size and/or improving performance without considering the privacy of user data.
We propose a privacy-preserving-oriented pruning and mobile acceleration framework that does not require the private training dataset.
arXiv Detail & Related papers (2020-03-13T23:52:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.