CipherPrune: Efficient and Scalable Private Transformer Inference
- URL: http://arxiv.org/abs/2502.16782v2
- Date: Wed, 05 Mar 2025 20:18:29 GMT
- Title: CipherPrune: Efficient and Scalable Private Transformer Inference
- Authors: Yancheng Zhang, Jiaqi Xue, Mengxin Zheng, Mimi Xie, Mingzhe Zhang, Lei Jiang, Qian Lou,
- Abstract summary: Private Transformer inference using cryptographic protocols offers promising solutions for privacy-preserving machine learning.<n>However, it still faces significant runtime overhead (efficiency issues) and challenges in handling long-token inputs.<n>We propose cipheritCipherPrune, an efficient and scalable private inference framework.
- Score: 12.853162687405465
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Private Transformer inference using cryptographic protocols offers promising solutions for privacy-preserving machine learning; however, it still faces significant runtime overhead (efficiency issues) and challenges in handling long-token inputs (scalability issues). We observe that the Transformer's operational complexity scales quadratically with the number of input tokens, making it essential to reduce the input token length. Notably, each token varies in importance, and many inputs contain redundant tokens. Additionally, prior private inference methods that rely on high-degree polynomial approximations for non-linear activations are computationally expensive. Therefore, reducing the polynomial degree for less important tokens can significantly accelerate private inference. Building on these observations, we propose \textit{CipherPrune}, an efficient and scalable private inference framework that includes a secure encrypted token pruning protocol, a polynomial reduction protocol, and corresponding Transformer network optimizations. At the protocol level, encrypted token pruning adaptively removes unimportant tokens from encrypted inputs in a progressive, layer-wise manner. Additionally, encrypted polynomial reduction assigns lower-degree polynomials to less important tokens after pruning, enhancing efficiency without decryption. At the network level, we introduce protocol-aware network optimization via a gradient-based search to maximize pruning thresholds and polynomial reduction conditions while maintaining the desired accuracy. Our experiments demonstrate that CipherPrune reduces the execution overhead of private Transformer inference by approximately $6.1\times$ for 128-token inputs and $10.6\times$ for 512-token inputs, compared to previous methods, with only a marginal drop in accuracy. The code is publicly available at https://github.com/UCF-Lou-Lab-PET/cipher-prune-inference.
Related papers
- Accelerating Private Large Transformers Inference through Fine-grained Collaborative Computation [8.859237832459876]
We present FASTLMPI, a new approach to accelerate private TBM inference through fine-grained optimization.<n>Specifically, through the fine-grained co-design of homomorphic encryption and secret sharing, FASTLMPI achieves efficient protocols for matrix multiplication, SoftMax, LayerNorm, and GeLULU.<n>FASTLMPI shows a remarkable 54% to 64% decrease in runtime and an impressive 72.2% reduction in communication costs.
arXiv Detail & Related papers (2024-12-21T08:33:12Z) - FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - Three-Input Ciphertext Multiplication for Homomorphic Encryption [6.390468088226496]
Homomorphic encryption (HE) allows computations directly on ciphertexts.
HE is essential to privacy-preserving computing, such as neural network inference, medical diagnosis, and financial data analysis.
This paper proposes 3-input ciphertext multiplication to reduce complexity of computations.
arXiv Detail & Related papers (2024-10-17T13:40:49Z) - CryptoTrain: Fast Secure Training on Encrypted Dataset [17.23344104239024]
We develop a hybrid cryptographic protocol that merges Homomorphic Encryption with Oblivious Transfer (OT) for handling linear and non-linear operations.
By integrating CCMul-Precompute and correlated convolution into CryptoTrain-B, we facilitate a rapid and efficient secure training framework.
arXiv Detail & Related papers (2024-09-25T07:06:14Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Implementation of Entropically Secure Encryption: Securing Personal Health Data [0.704590071265998]
Entropically Secure Encryption (ESE) offers unconditional security with shorter keys to the One-Time Pad.
We present the first implementation of ESE for bulk encryption.
arXiv Detail & Related papers (2024-04-04T12:07:33Z) - Transformer based Pluralistic Image Completion with Reduced Information Loss [72.92754600354199]
Transformer based methods have achieved great success in image inpainting recently.
They regard each pixel as a token, thus suffering from an information loss issue.
We propose a new transformer based framework called "PUT"
arXiv Detail & Related papers (2024-03-31T01:20:16Z) - Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing
Important Tokens [65.4435926060951]
We propose to significantly improve the efficiency of Transformers for ultra long sequences, by compressing the sequence into a much smaller representation at each layer.
Our algorithm is not only efficient (achieving more than $3times$ efficiency gain compared to baselines on 4K and 16K lengths) but also offers competitive/better performance on a large number of tasks.
arXiv Detail & Related papers (2023-05-07T10:32:18Z) - THE-X: Privacy-Preserving Transformer Inference with Homomorphic
Encryption [112.02441503951297]
Privacy-preserving inference of transformer models is on the demand of cloud service users.
We introduce $textitTHE-X$, an approximation approach for transformers, which enables privacy-preserving inference of pre-trained models.
arXiv Detail & Related papers (2022-06-01T03:49:18Z) - Learned Token Pruning for Transformers [39.181816379061374]
Learned Token Pruning () method reduces redundant tokens as the data passes through the different layers of a transformer.
We extensively test the performance of our approach on multiple GLUE tasks.
Preliminary results show up to 1.4x and 1.9x throughput improvement on Tesla T4 and Intel Haswell.
arXiv Detail & Related papers (2021-07-02T09:00:13Z) - FFConv: Fast Factorized Neural Network Inference on Encrypted Data [9.868787266501036]
We propose a low-rank factorization method called FFConv to unify convolution and ciphertext packing.
Compared to prior art LoLa and Falcon, our method reduces the inference latency by up to 87% and 12%, respectively.
arXiv Detail & Related papers (2021-02-06T03:10:13Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.