Related papers: Privacy-Preserving Inference for Quantized BERT Models

Privacy-Preserving Inference for Quantized BERT Models

URL: http://arxiv.org/abs/2508.01636v1
Date: Sun, 03 Aug 2025 07:52:08 GMT
Title: Privacy-Preserving Inference for Quantized BERT Models
Authors: Tianpei Lu, Bingsheng Zhang, Lekun Peng, Bowen Zheng, Lichun Li, Kui Ren,
Abstract summary: Quantization offers a promising solution by converting floating-point operations into lower-precision integer computations.<n>We propose a fine-grained, layer-wise quantization scheme and support 1-bit weight fully connected layers in a secure setting.
Score: 13.36359444231145
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the increasing deployment of generative machine learning models in privacy-sensitive domains such as healthcare and personalized services, ensuring secure inference has become a critical challenge. Secure multi-party computation (MPC) enables privacy-preserving model inference but suffers from high communication and computation overhead. The main bottleneck lies in the expensive secure evaluation of floating-point operations. Quantization offers a promising solution by converting floating-point operations into lower-precision integer computations, significantly reducing overhead. However, existing MPC-based quantized inference methods either rely on public quantization parameters-posing privacy risks-or suffer from inefficiencies, particularly in handling nonlinear functions such as activations and softmax. In this work, we propose a fine-grained, layer-wise quantization scheme and support 1-bit weight fully connected layers in a secure setting. We design a multi-input lookup table protocol to evaluate softmax efficiently and securely. Furthermore, we use dual secret sharing schemes and perform precision conversions via lookup tables, eliminating truncation overhead entirely. Experimental evaluation on BERT-base models demonstrates that our approach achieves up to $8\times$ speedup compared to Lu \emph{et al}. (NDSS 25), $9\times$ speedup compared to Gupta \emph{et al}. (PETS 24) and $22 \times$ speedup compared to Knott \emph{et al}. (NeurIPS 21).

Related papers

Dual-Priv Pruning : Efficient Differential Private Fine-Tuning in Multimodal Large Language Models [21.598534853947676]
We propose a framework that employs two complementary pruning mechanisms for Differential Privacy (DP) fine-tuning in MLLMs.<n>Our approach consistently utilizes less memory than standard DP-SGD.<n>To the best of our knowledge, we are the first to explore DP fine-tuning in MLLMs.
arXiv Detail & Related papers (2025-06-08T10:33:01Z)
Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding.<n>PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models.<n>Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z)
Ditto: Quantization-aware Secure Inference of Transformers upon MPC [5.161569981377991]
We propose the framework named Ditto to enable more efficient quantization-aware secure Transformer inference. We conduct extensive experiments on Bert and GPT2 models to evaluate the performance of Ditto. The results demonstrate that Ditto is about $3.14sim 4.40times$ faster than MPCFormer and $1.44sim 2.35times$ faster than the state-of-the-art work PUMA.
arXiv Detail & Related papers (2024-05-09T03:28:16Z)
Tiny Deep Ensemble: Uncertainty Estimation in Edge AI Accelerators via Ensembling Normalization Layers with Shared Weights [0.8233872344445676]
In AI-driven systems, uncertainty estimation allows the user to avoid overconfidence predictions and achieve functional safety. We propose the Tiny-Deep Ensemble approach, a low-cost approach for uncertainty estimation on edge devices. Our method does not compromise accuracy, with an increase in inference accuracy of up to $sim 1%$ and a reduction in RMSE of $17.17%$ in various benchmark datasets.
arXiv Detail & Related papers (2024-05-07T22:54:17Z)
Improved Communication-Privacy Trade-offs in $L_2$ Mean Estimation under Streaming Differential Privacy [47.997934291881414]
Existing mean estimation schemes are usually optimized for $L_infty$ geometry and rely on random rotation or Kashin's representation to adapt to $L$ geometry. We introduce a novel privacy accounting method for the sparsified Gaussian mechanism that incorporates the randomness inherent in sparsification into the DP. Unlike previous approaches, our accounting algorithm directly operates in $L$ geometry, yielding MSEs that fast converge to those of the Gaussian mechanism.
arXiv Detail & Related papers (2024-05-02T03:48:47Z)
DB-LLM: Accurate Dual-Binarization for Efficient LLMs [83.70686728471547]
Large language models (LLMs) have significantly advanced the field of natural language processing. Existing ultra-low-bit quantization always causes severe accuracy drops. We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
arXiv Detail & Related papers (2024-02-19T09:04:30Z)
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit. We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z)
ScionFL: Efficient and Robust Secure Quantized Aggregation [36.668162197302365]
We introduce ScionFL, the first secure aggregation framework for federated learning. It operates efficiently on quantized inputs and simultaneously provides robustness against malicious clients. We show that with no overhead for clients and moderate overhead for the server, we obtain comparable accuracy for standard FL benchmarks.
arXiv Detail & Related papers (2022-10-13T21:46:55Z)
Integer-arithmetic-only Certified Robustness for Quantized Neural Networks [14.737638416823772]
A line of work on tackling adversarial examples is certified robustness via randomized smoothing. Such a mechanism usually uses floating-point arithmetic for calculations in inference. We show our approach can obtain a comparable accuracy and 4x5x speedup over floating-point arithmetic certified robust methods.
arXiv Detail & Related papers (2021-08-21T01:15:19Z)
I-BERT: Integer-only BERT Quantization [78.43819756382103]
We propose I-BERT, a novel quantization scheme for Transformer based models. I-BERT performs an end-to-end integer-only BERT inference without any floating point calculation. We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to the full-precision baseline.
arXiv Detail & Related papers (2021-01-05T02:42:58Z)
AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation. Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.