Accelerating Private Large Transformers Inference through Fine-grained Collaborative Computation
- URL: http://arxiv.org/abs/2412.16537v2
- Date: Sat, 25 Jan 2025 14:10:34 GMT
- Title: Accelerating Private Large Transformers Inference through Fine-grained Collaborative Computation
- Authors: Yuntian Chen, Zhanyong Tang, Tianpei Lu, Bingsheng Zhang, Zhiying Shi, Zheng Wang,
- Abstract summary: We present FASTLMPI, a new approach to accelerate private TBM inference through fine-grained optimization.
Specifically, through the fine-grained co-design of homomorphic encryption and secret sharing, FASTLMPI achieves efficient protocols for matrix multiplication, SoftMax, LayerNorm, and GeLULU.
FASTLMPI shows a remarkable 54% to 64% decrease in runtime and an impressive 72.2% reduction in communication costs.
- Score: 8.859237832459876
- License:
- Abstract: Homomorphic encryption (HE) and secret sharing (SS) enable computations on encrypted data, providing significant privacy benefits for large transformer-based models (TBM) in sensitive sectors like medicine and finance. However, private TBM inference incurs significant costs due to the coarse-grained application of HE and SS. We present FASTLMPI, a new approach to accelerate private TBM inference through fine-grained computation optimization. Specifically, through the fine-grained co-design of homomorphic encryption and secret sharing, FASTLMPI achieves efficient protocols for matrix multiplication, SoftMax, LayerNorm, and GeLU. In addition, FASTLMPI introduces a precise segmented approximation technique for differentiable non-linear, improving its fitting accuracy while maintaining a low polynomial degree. Compared to solution BOLT (S&P'24), FASTLMPI shows a remarkable 54% to 64% decrease in runtime and an impressive 72.2% reduction in communication costs.
Related papers
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs [74.74225314708225]
Multi-head Latent Attention (MLA) is an innovative architecture designed to ensure efficient and economical inference.
This paper proposes the first data-efficient fine-tuning method for transitioning from Multi-Head Attention to MLA.
arXiv Detail & Related papers (2025-02-20T18:50:42Z) - Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding.
PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models.
Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z) - Encryption-Friendly LLM Architecture [11.386436468650016]
Homomorphic encryption (HE) is a cryptographic protocol supporting arithmetic computations in encrypted states.
We propose a modified HE-friendly transformer architecture with an emphasis on inference following personalized (private) fine-tuning.
arXiv Detail & Related papers (2024-10-03T13:48:35Z) - Improved Communication-Privacy Trade-offs in $L_2$ Mean Estimation under Streaming Differential Privacy [47.997934291881414]
Existing mean estimation schemes are usually optimized for $L_infty$ geometry and rely on random rotation or Kashin's representation to adapt to $L$ geometry.
We introduce a novel privacy accounting method for the sparsified Gaussian mechanism that incorporates the randomness inherent in sparsification into the DP.
Unlike previous approaches, our accounting algorithm directly operates in $L$ geometry, yielding MSEs that fast converge to those of the Gaussian mechanism.
arXiv Detail & Related papers (2024-05-02T03:48:47Z) - Batch-oriented Element-wise Approximate Activation for Privacy-Preserving Neural Networks [5.039738753594332]
Homomorphic Encryption (FHE) is facing a great challenge that homomorphic operations cannot be easily adapted for non-linear activation calculations.
Batch-oriented element-wise data packing and approximate activation are proposed, which train linear low-degrees to approximate the non-linear activation function - ReLU.
Experiment results show that when ciphertext inference is performed on 4096 input images, compared with the current most efficient channel-wise method, the inference accuracy is improved by 1.65%, and the amortized inference time is reduced by 99.5%.
arXiv Detail & Related papers (2024-03-16T13:26:33Z) - SecFormer: Fast and Accurate Privacy-Preserving Inference for Transformer Models via SMPC [34.63351580241698]
We introduce a comprehensive PPI framework called SecFormer to achieve fast and accurate PPI for Transformer models.
In terms of efficiency, SecFormer is 3.57 and 3.58 times faster than PUMA for BERT$_textBASE$ and BERT$_textLARGE$, demonstrating its effectiveness and speed.
arXiv Detail & Related papers (2024-01-01T15:40:35Z) - Practical Privacy-Preserving Gaussian Process Regression via Secret
Sharing [23.80837224347696]
This paper proposes a privacy-preserving GPR method based on secret sharing (SS)
We derive a new SS-based exponentiation operation through the idea of 'confusion-correction' and construct an SS-based matrix inversion algorithm based on Cholesky decomposition.
Empirical results show that our proposed method can achieve reasonable accuracy and efficiency under the premise of preserving data privacy.
arXiv Detail & Related papers (2023-06-26T08:17:51Z) - Breaking the Communication-Privacy-Accuracy Tradeoff with
$f$-Differential Privacy [51.11280118806893]
We consider a federated data analytics problem in which a server coordinates the collaborative data analysis of multiple users with privacy concerns and limited communication capability.
We study the local differential privacy guarantees of discrete-valued mechanisms with finite output space through the lens of $f$-differential privacy (DP)
More specifically, we advance the existing literature by deriving tight $f$-DP guarantees for a variety of discrete-valued mechanisms.
arXiv Detail & Related papers (2023-02-19T16:58:53Z) - Distributed Reinforcement Learning for Privacy-Preserving Dynamic Edge
Caching [91.50631418179331]
A privacy-preserving distributed deep policy gradient (P2D3PG) is proposed to maximize the cache hit rates of devices in the MEC networks.
We convert the distributed optimizations into model-free Markov decision process problems and then introduce a privacy-preserving federated learning method for popularity prediction.
arXiv Detail & Related papers (2021-10-20T02:48:27Z) - Differentially Private Federated Learning with Laplacian Smoothing [72.85272874099644]
Federated learning aims to protect data privacy by collaboratively learning a model without sharing private data among users.
An adversary may still be able to infer the private training data by attacking the released model.
Differential privacy provides a statistical protection against such attacks at the price of significantly degrading the accuracy or utility of the trained models.
arXiv Detail & Related papers (2020-05-01T04:28:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.