SecFormer: Towards Fast and Accurate Privacy-Preserving Inference for Large Language Models
- URL: http://arxiv.org/abs/2401.00793v3
- Date: Thu, 6 Jun 2024 05:22:44 GMT
- Title: SecFormer: Towards Fast and Accurate Privacy-Preserving Inference for Large Language Models
- Authors: Jinglong Luo, Yehong Zhang, Zhuo Zhang, Jiaqi Zhang, Xin Mu, Hui Wang, Yue Yu, Zenglin Xu,
- Abstract summary: We introduce an advanced optimization framework called SecFormer to achieve fast and accurate PPI for Transformer models.
In terms of efficiency, SecFormer is 3.56 and 3.58 times faster than Puma for BERT$_textBASE$ and BERT$_textLARGE$, respectively.
- Score: 34.63351580241698
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the growing use of large language models hosted on cloud platforms to offer inference services, privacy concerns are escalating, especially concerning sensitive data like investment plans and bank account details. Secure Multi-Party Computing (SMPC) emerges as a promising solution to protect the privacy of inference data and model parameters. However, the application of SMPC in Privacy-Preserving Inference (PPI) for large language models, particularly those based on the Transformer architecture, often leads to considerable slowdowns or declines in performance. This is largely due to the multitude of nonlinear operations in the Transformer architecture, which are not well-suited to SMPC and difficult to circumvent or optimize effectively. To address this concern, we introduce an advanced optimization framework called SecFormer, to achieve fast and accurate PPI for Transformer models. By implementing model design optimization, we successfully eliminate the high-cost exponential and maximum operations in PPI without sacrificing model performance. Additionally, we have developed a suite of efficient SMPC protocols that utilize segmented polynomials, Fourier series and Goldschmidt's method to handle other complex nonlinear functions within PPI, such as GeLU, LayerNorm, and Softmax. Our extensive experiments reveal that SecFormer outperforms MPCFormer in performance, showing improvements of $5.6\%$ and $24.2\%$ for BERT$_{\text{BASE}}$ and BERT$_{\text{LARGE}}$, respectively. In terms of efficiency, SecFormer is 3.56 and 3.58 times faster than Puma for BERT$_{\text{BASE}}$ and BERT$_{\text{LARGE}}$, demonstrating its effectiveness and speed.
Related papers
- Accelerating Private Large Transformers Inference through Fine-grained Collaborative Computation [8.859237832459876]
We present FASTLMPI, a new approach to accelerate private TBM inference through fine-grained optimization.
Specifically, through the fine-grained co-design of homomorphic encryption and secret sharing, FASTLMPI achieves efficient protocols for matrix multiplication, SoftMax, LayerNorm, and GeLULU.
FASTLMPI shows a remarkable 54% to 64% decrease in runtime and an impressive 72.2% reduction in communication costs.
arXiv Detail & Related papers (2024-12-21T08:33:12Z) - MOFHEI: Model Optimizing Framework for Fast and Efficient Homomorphically Encrypted Neural Network Inference [0.8388591755871735]
Homomorphic Encryption (HE) enables us to perform machine learning tasks over encrypted data.
We propose MOFHEI, a framework that optimize the model to make HE-based neural network inference, fast and efficient.
Our framework achieves up to 98% pruning ratio on LeNet, eliminating up to 93% of the required HE operations for performing PI.
arXiv Detail & Related papers (2024-12-10T22:44:54Z) - AdaPI: Facilitating DNN Model Adaptivity for Efficient Private Inference in Edge Computing [20.11448308239082]
AdaPI is a novel approach that achieves adaptive PI by allowing a model to perform well across edge devices with diverse energy budgets.
AdaPI attains optimal accuracy for each energy budget, which outperforms the state-of-the-art PI methods by 7.3% in terms of test accuracy on CIFAR-100.
arXiv Detail & Related papers (2024-07-08T05:58:49Z) - Ditto: Quantization-aware Secure Inference of Transformers upon MPC [5.161569981377991]
We propose the framework named Ditto to enable more efficient quantization-aware secure Transformer inference.
We conduct extensive experiments on Bert and GPT2 models to evaluate the performance of Ditto.
The results demonstrate that Ditto is about $3.14sim 4.40times$ faster than MPCFormer and $1.44sim 2.35times$ faster than the state-of-the-art work PUMA.
arXiv Detail & Related papers (2024-05-09T03:28:16Z) - Improved Communication-Privacy Trade-offs in $L_2$ Mean Estimation under Streaming Differential Privacy [47.997934291881414]
Existing mean estimation schemes are usually optimized for $L_infty$ geometry and rely on random rotation or Kashin's representation to adapt to $L$ geometry.
We introduce a novel privacy accounting method for the sparsified Gaussian mechanism that incorporates the randomness inherent in sparsification into the DP.
Unlike previous approaches, our accounting algorithm directly operates in $L$ geometry, yielding MSEs that fast converge to those of the Gaussian mechanism.
arXiv Detail & Related papers (2024-05-02T03:48:47Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - East: Efficient and Accurate Secure Transformer Framework for Inference [7.887332345182056]
We propose a framework emphEast to enable efficient and accurate secure Transformer inference.
Compared to Iron, we achieve about 1.8$times$ lower communication within 1.2$times$ lower runtime.
arXiv Detail & Related papers (2023-08-19T06:26:14Z) - MPCFormer: fast, performant and private Transformer inference with MPC [64.23599808800738]
We design the framework MPCFORMER using secure multi-party computation (MPC) and Knowledge Distillation (KD)
MPCFORMER significantly speeds up Transformer model inference in MPC settings while achieving similar ML performance to the input model.
We show that MPCFORMER remains effective with different trained Transformer weights such as ROBERTABASE and larger models including BERTLarge.
arXiv Detail & Related papers (2022-11-02T19:43:22Z) - THE-X: Privacy-Preserving Transformer Inference with Homomorphic
Encryption [112.02441503951297]
Privacy-preserving inference of transformer models is on the demand of cloud service users.
We introduce $textitTHE-X$, an approximation approach for transformers, which enables privacy-preserving inference of pre-trained models.
arXiv Detail & Related papers (2022-06-01T03:49:18Z) - A Privacy-Preserving-Oriented DNN Pruning and Mobile Acceleration
Framework [56.57225686288006]
Weight pruning of deep neural networks (DNNs) has been proposed to satisfy the limited storage and computing capability of mobile edge devices.
Previous pruning methods mainly focus on reducing the model size and/or improving performance without considering the privacy of user data.
We propose a privacy-preserving-oriented pruning and mobile acceleration framework that does not require the private training dataset.
arXiv Detail & Related papers (2020-03-13T23:52:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.