Related papers: Breaking the Layer Barrier: Remodeling Private Transformer Inference with Hybrid CKKS and MPC

Breaking the Layer Barrier: Remodeling Private Transformer Inference with Hybrid CKKS and MPC

URL: http://arxiv.org/abs/2508.19525v2
Date: Mon, 01 Sep 2025 13:24:24 GMT
Title: Breaking the Layer Barrier: Remodeling Private Transformer Inference with Hybrid CKKS and MPC
Authors: Tianshi Xu, Wen-jie Lu, Jiangrui Yu, Chen Yi, Chenqi Lin, Runsheng Wang, Meng Li,
Abstract summary: This paper presents an efficient framework for private Transformer inference that combines Homomorphic Encryption (HE) and Secure Multi-party Computation (MPC) to protect data privacy.<n>The proposed framework, dubbed BLB, overcomes this by breaking down layers into fine-grained operators and further fusing adjacent linear operators, reducing the need for HE/MPC conversions.<n>BLB achieves a $21times$ reduction in communication overhead compared to BOLT (S&P'24) and a $2times$ reduction compared to Bumblebee (NDSS'25), along with latency reductions of $13times$ and $1.8
Score: 16.452180247201948
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This paper presents an efficient framework for private Transformer inference that combines Homomorphic Encryption (HE) and Secure Multi-party Computation (MPC) to protect data privacy. Existing methods often leverage HE for linear layers (e.g., matrix multiplications) and MPC for non-linear layers (e.g., Softmax activation functions), but the conversion between HE and MPC introduces significant communication costs. The proposed framework, dubbed BLB, overcomes this by breaking down layers into fine-grained operators and further fusing adjacent linear operators, reducing the need for HE/MPC conversions. To manage the increased ciphertext bit width from the fused linear operators, BLB proposes the first secure conversion protocol between CKKS and MPC and enables CKKS-based computation of the fused operators. Additionally, BLB proposes an efficient matrix multiplication protocol for fused computation in Transformers. Extensive evaluations on BERT-base, BERT-large, and GPT2-base show that BLB achieves a $21\times$ reduction in communication overhead compared to BOLT (S\&P'24) and a $2\times$ reduction compared to Bumblebee (NDSS'25), along with latency reductions of $13\times$ and $1.8\times$, respectively, when leveraging GPU acceleration.

Related papers

LRD-MPC: Efficient MPC Inference through Low-rank Decomposition [11.1852308328843]
Secure Multi-party Computation enables untrusted parties to jointly compute a function without revealing their inputs.<n>Deep neural networks rely heavily on convolutional and fully connected layers, which require costly matrix multiplications in MPC.<n>We propose leveraging low-rank decomposition (LRD) for linear layers, replacing one large matrix multiplication with two smaller ones.
arXiv Detail & Related papers (2026-02-16T02:11:38Z)
EcoSpa: Efficient Transformer Training with Coupled Sparsity [79.5008521101473]
Transformers have become the backbone of modern AI, yet their high computational demands pose critical system challenges.<n>We introduce EcoSpa, an efficient structured sparse training method that jointly evaluates and sparsifies coupled weight matrix pairs.
arXiv Detail & Related papers (2025-11-09T11:23:43Z)
Efficient and Privacy-Preserving Binary Dot Product via Multi-Party Computation [4.336006969179338]
This paper proposes a novel binary multi-party computation (BiMPC) framework for bitwise operations.<n>The core of BiMPC is a novel approach called Dot Product via Modular Addition (DoMA), which uses regular and modular additions for efficient binary dot product calculation.<n>The privacy guarantees of the BiMPC framework are rigorously analyzed, demonstrating its efficiency and scalability in distributed settings.
arXiv Detail & Related papers (2025-10-18T03:35:42Z)
Privacy-Preserving Inference for Quantized BERT Models [13.36359444231145]
Quantization offers a promising solution by converting floating-point operations into lower-precision integer computations.<n>We propose a fine-grained, layer-wise quantization scheme and support 1-bit weight fully connected layers in a secure setting.
arXiv Detail & Related papers (2025-08-03T07:52:08Z)
Guard-GBDT: Efficient Privacy-Preserving Approximated GBDT Training on Vertical Dataset [15.175697228634979]
Guard-GBDT is an innovative framework tailored for efficient and privacy-preserving GBDT training on vertical datasets.<n>We implement a prototype of Guard-GBDT and extensively evaluate its performance and accuracy on various real-world datasets.
arXiv Detail & Related papers (2025-07-28T10:16:37Z)
Joint Transmit and Pinching Beamforming for Pinching Antenna Systems (PASS): Optimization-Based or Learning-Based? [89.05848771674773]
A novel antenna system ()-enabled downlink multi-user multiple-input single-output (MISO) framework is proposed.<n>It consists of multiple waveguides, which equip numerous low-cost antennas, named (PAs)<n>The positions of PAs can be reconfigured to both spanning large-scale path and space.
arXiv Detail & Related papers (2025-02-12T18:54:10Z)
Near-Optimal Online Learning for Multi-Agent Submodular Coordination: Tight Approximation and Communication Efficiency [52.60557300927007]
We present a $textbfMA-OSMA$ algorithm to transfer the discrete submodular problem into a continuous optimization.<n>We also introduce a projection-free $textbfMA-OSEA$ algorithm, which effectively utilizes the KL divergence by mixing a uniform distribution.<n>Our algorithms significantly improve the $(frac11+c)$-approximation provided by the state-of-the-art OSG algorithm.
arXiv Detail & Related papers (2025-02-07T15:57:56Z)
BiMaCoSR: Binary One-Step Diffusion Model Leveraging Flexible Matrix Compression for Real Super-Resolution [63.777210548110425]
We propose BiMaCoSR, which combines binarization and one-step distillation to obtain extreme compression and acceleration.<n>BiMaCoSR achieves a 23.8x compression ratio and a 27.4x speedup ratio compared to FP counterpart.
arXiv Detail & Related papers (2025-02-01T06:34:55Z)
Accelerating Private Large Transformers Inference through Fine-grained Collaborative Computation [8.859237832459876]
We present FASTLMPI, a new approach to accelerate private TBM inference through fine-grained optimization.<n>Specifically, through the fine-grained co-design of homomorphic encryption and secret sharing, FASTLMPI achieves efficient protocols for matrix multiplication, SoftMax, LayerNorm, and GeLULU.<n>FASTLMPI shows a remarkable 54% to 64% decrease in runtime and an impressive 72.2% reduction in communication costs.
arXiv Detail & Related papers (2024-12-21T08:33:12Z)
Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders. We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z)
Ditto: Quantization-aware Secure Inference of Transformers upon MPC [5.161569981377991]
We propose the framework named Ditto to enable more efficient quantization-aware secure Transformer inference. We conduct extensive experiments on Bert and GPT2 models to evaluate the performance of Ditto. The results demonstrate that Ditto is about $3.14sim 4.40times$ faster than MPCFormer and $1.44sim 2.35times$ faster than the state-of-the-art work PUMA.
arXiv Detail & Related papers (2024-05-09T03:28:16Z)
Improved Communication-Privacy Trade-offs in $L_2$ Mean Estimation under Streaming Differential Privacy [47.997934291881414]
Existing mean estimation schemes are usually optimized for $L_infty$ geometry and rely on random rotation or Kashin's representation to adapt to $L$ geometry. We introduce a novel privacy accounting method for the sparsified Gaussian mechanism that incorporates the randomness inherent in sparsification into the DP. Unlike previous approaches, our accounting algorithm directly operates in $L$ geometry, yielding MSEs that fast converge to those of the Gaussian mechanism.
arXiv Detail & Related papers (2024-05-02T03:48:47Z)
Communication-Efficient Distributed Learning with Local Immediate Error Compensation [95.6828475028581]
We propose the Local Immediate Error Compensated SGD (LIEC-SGD) optimization algorithm. LIEC-SGD is superior to previous works in either the convergence rate or the communication cost.
arXiv Detail & Related papers (2024-02-19T05:59:09Z)
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.