Breaking the Layer Barrier: Remodeling Private Transformer Inference with Hybrid CKKS and MPC
- URL: http://arxiv.org/abs/2508.19525v2
- Date: Mon, 01 Sep 2025 13:24:24 GMT
- Title: Breaking the Layer Barrier: Remodeling Private Transformer Inference with Hybrid CKKS and MPC
- Authors: Tianshi Xu, Wen-jie Lu, Jiangrui Yu, Chen Yi, Chenqi Lin, Runsheng Wang, Meng Li,
- Abstract summary: This paper presents an efficient framework for private Transformer inference that combines Homomorphic Encryption (HE) and Secure Multi-party Computation (MPC) to protect data privacy.<n>The proposed framework, dubbed BLB, overcomes this by breaking down layers into fine-grained operators and further fusing adjacent linear operators, reducing the need for HE/MPC conversions.<n>BLB achieves a $21times$ reduction in communication overhead compared to BOLT (S&P'24) and a $2times$ reduction compared to Bumblebee (NDSS'25), along with latency reductions of $13times$ and $1.8
- Score: 16.452180247201948
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper presents an efficient framework for private Transformer inference that combines Homomorphic Encryption (HE) and Secure Multi-party Computation (MPC) to protect data privacy. Existing methods often leverage HE for linear layers (e.g., matrix multiplications) and MPC for non-linear layers (e.g., Softmax activation functions), but the conversion between HE and MPC introduces significant communication costs. The proposed framework, dubbed BLB, overcomes this by breaking down layers into fine-grained operators and further fusing adjacent linear operators, reducing the need for HE/MPC conversions. To manage the increased ciphertext bit width from the fused linear operators, BLB proposes the first secure conversion protocol between CKKS and MPC and enables CKKS-based computation of the fused operators. Additionally, BLB proposes an efficient matrix multiplication protocol for fused computation in Transformers. Extensive evaluations on BERT-base, BERT-large, and GPT2-base show that BLB achieves a $21\times$ reduction in communication overhead compared to BOLT (S\&P'24) and a $2\times$ reduction compared to Bumblebee (NDSS'25), along with latency reductions of $13\times$ and $1.8\times$, respectively, when leveraging GPU acceleration.
Related papers
- LRD-MPC: Efficient MPC Inference through Low-rank Decomposition [11.1852308328843]
Secure Multi-party Computation enables untrusted parties to jointly compute a function without revealing their inputs.<n>Deep neural networks rely heavily on convolutional and fully connected layers, which require costly matrix multiplications in MPC.<n>We propose leveraging low-rank decomposition (LRD) for linear layers, replacing one large matrix multiplication with two smaller ones.
arXiv Detail & Related papers (2026-02-16T02:11:38Z) - EcoSpa: Efficient Transformer Training with Coupled Sparsity [79.5008521101473]
Transformers have become the backbone of modern AI, yet their high computational demands pose critical system challenges.<n>We introduce EcoSpa, an efficient structured sparse training method that jointly evaluates and sparsifies coupled weight matrix pairs.
arXiv Detail & Related papers (2025-11-09T11:23:43Z) - Efficient and Privacy-Preserving Binary Dot Product via Multi-Party Computation [4.336006969179338]
This paper proposes a novel binary multi-party computation (BiMPC) framework for bitwise operations.<n>The core of BiMPC is a novel approach called Dot Product via Modular Addition (DoMA), which uses regular and modular additions for efficient binary dot product calculation.<n>The privacy guarantees of the BiMPC framework are rigorously analyzed, demonstrating its efficiency and scalability in distributed settings.
arXiv Detail & Related papers (2025-10-18T03:35:42Z) - Privacy-Preserving Inference for Quantized BERT Models [13.36359444231145]
Quantization offers a promising solution by converting floating-point operations into lower-precision integer computations.<n>We propose a fine-grained, layer-wise quantization scheme and support 1-bit weight fully connected layers in a secure setting.
arXiv Detail & Related papers (2025-08-03T07:52:08Z) - Guard-GBDT: Efficient Privacy-Preserving Approximated GBDT Training on Vertical Dataset [15.175697228634979]
Guard-GBDT is an innovative framework tailored for efficient and privacy-preserving GBDT training on vertical datasets.<n>We implement a prototype of Guard-GBDT and extensively evaluate its performance and accuracy on various real-world datasets.
arXiv Detail & Related papers (2025-07-28T10:16:37Z) - Joint Transmit and Pinching Beamforming for Pinching Antenna Systems (PASS): Optimization-Based or Learning-Based? [89.05848771674773]
A novel antenna system ()-enabled downlink multi-user multiple-input single-output (MISO) framework is proposed.<n>It consists of multiple waveguides, which equip numerous low-cost antennas, named (PAs)<n>The positions of PAs can be reconfigured to both spanning large-scale path and space.
arXiv Detail & Related papers (2025-02-12T18:54:10Z) - Near-Optimal Online Learning for Multi-Agent Submodular Coordination: Tight Approximation and Communication Efficiency [52.60557300927007]
We present a $textbfMA-OSMA$ algorithm to transfer the discrete submodular problem into a continuous optimization.<n>We also introduce a projection-free $textbfMA-OSEA$ algorithm, which effectively utilizes the KL divergence by mixing a uniform distribution.<n>Our algorithms significantly improve the $(frac11+c)$-approximation provided by the state-of-the-art OSG algorithm.
arXiv Detail & Related papers (2025-02-07T15:57:56Z) - BiMaCoSR: Binary One-Step Diffusion Model Leveraging Flexible Matrix Compression for Real Super-Resolution [63.777210548110425]
We propose BiMaCoSR, which combines binarization and one-step distillation to obtain extreme compression and acceleration.<n>BiMaCoSR achieves a 23.8x compression ratio and a 27.4x speedup ratio compared to FP counterpart.
arXiv Detail & Related papers (2025-02-01T06:34:55Z) - Accelerating Private Large Transformers Inference through Fine-grained Collaborative Computation [8.859237832459876]
We present FASTLMPI, a new approach to accelerate private TBM inference through fine-grained optimization.<n>Specifically, through the fine-grained co-design of homomorphic encryption and secret sharing, FASTLMPI achieves efficient protocols for matrix multiplication, SoftMax, LayerNorm, and GeLULU.<n>FASTLMPI shows a remarkable 54% to 64% decrease in runtime and an impressive 72.2% reduction in communication costs.
arXiv Detail & Related papers (2024-12-21T08:33:12Z) - Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders.
We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z) - Ditto: Quantization-aware Secure Inference of Transformers upon MPC [5.161569981377991]
We propose the framework named Ditto to enable more efficient quantization-aware secure Transformer inference.
We conduct extensive experiments on Bert and GPT2 models to evaluate the performance of Ditto.
The results demonstrate that Ditto is about $3.14sim 4.40times$ faster than MPCFormer and $1.44sim 2.35times$ faster than the state-of-the-art work PUMA.
arXiv Detail & Related papers (2024-05-09T03:28:16Z) - Improved Communication-Privacy Trade-offs in $L_2$ Mean Estimation under Streaming Differential Privacy [47.997934291881414]
Existing mean estimation schemes are usually optimized for $L_infty$ geometry and rely on random rotation or Kashin's representation to adapt to $L$ geometry.
We introduce a novel privacy accounting method for the sparsified Gaussian mechanism that incorporates the randomness inherent in sparsification into the DP.
Unlike previous approaches, our accounting algorithm directly operates in $L$ geometry, yielding MSEs that fast converge to those of the Gaussian mechanism.
arXiv Detail & Related papers (2024-05-02T03:48:47Z) - Communication-Efficient Distributed Learning with Local Immediate Error
Compensation [95.6828475028581]
We propose the Local Immediate Error Compensated SGD (LIEC-SGD) optimization algorithm.
LIEC-SGD is superior to previous works in either the convergence rate or the communication cost.
arXiv Detail & Related papers (2024-02-19T05:59:09Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.