Related papers: FastFHE: Packing-Scalable and Depthwise-Separable CNN Inference Over FHE

FastFHE: Packing-Scalable and Depthwise-Separable CNN Inference Over FHE

URL: http://arxiv.org/abs/2511.22434v1
Date: Thu, 27 Nov 2025 13:14:42 GMT
Title: FastFHE: Packing-Scalable and Depthwise-Separable CNN Inference Over FHE
Authors: Wenbo Song, Xinxin Fan, Quanliang Jing, Shaoye Luo, Wenqi Wei, Chi Lin, Yunfeng Lu, Ling Liu,
Abstract summary: We propose FastFHE to accelerate the model inference while simultaneously high inference accuracy over fully homomorphic encryption.<n>First, we propose a new scalable ciphertext data-packing scheme to save the time and storage consumptions.<n>Third, we figure out a BN dot-product fusion matrix to merge the ciphertext convolutional layer with the batch-normalization layer without incurring extra multiplicative depth.
Score: 8.949311128871928
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The deep learning (DL) has been penetrating daily life in many domains, how to keep the DL model inference secure and sample privacy in an encrypted environment has become an urgent and increasingly important issue for various security-critical applications. To date, several approaches have been proposed based on the Residue Number System variant of the Cheon-Kim-Kim-Song (RNS-CKKS) scheme. However, they all suffer from high latency, which severely limits the applications in real-world tasks. Currently, the research on encrypted inference in deep CNNs confronts three main bottlenecks: i) the time and storage costs of convolution calculation; ii) the time overhead of huge bootstrapping operations; and iii) the consumption of circuit multiplication depth. Towards these three challenges, we in this paper propose an efficient and effective mechanism FastFHE to accelerate the model inference while simultaneously retaining high inference accuracy over fully homomorphic encryption. Concretely, our work elaborates four unique novelties. First, we propose a new scalable ciphertext data-packing scheme to save the time and storage consumptions. Second, we work out a depthwise-separable convolution fashion to degrade the computation load of convolution calculation. Third, we figure out a BN dot-product fusion matrix to merge the ciphertext convolutional layer with the batch-normalization layer without incurring extra multiplicative depth. Last but not least, we adopt the low-degree Legendre polynomial to approximate the nonlinear smooth activation function SiLU under the guarantee of tiny accuracy error before and after encrypted inference. Finally, we execute multi-facet experiments to verify the efficiency and effectiveness of our proposed approach.

Related papers

Volley Revolver: A Novel Matrix-Encoding Method for Privacy-Preserving Deep Learning (Inference++) [0.0]
Homomorphic encryption has emerged as a promising approach for enabling secure machine learning in untrusted environments.<n>In this paper, we propose an improved encoding and computation framework that removes the requirement that a single encrypted ciphertext must fully contain one input image.<n>Our method reformulates the data layout and homomorphic operations to partition high-resolution inputs across multiple ciphertexts.
arXiv Detail & Related papers (2025-12-21T08:40:31Z)
Efficient Decoding Methods for Language Models on Encrypted Data [32.58944595512403]
Homomorphic encryption (HE) enables computation on encrypted data for secure inference.<n>Neural text generation requires decoding methods like argmax and sampling, which are non-polynomial and thus computationally expensive under encryption.<n>We introduce cutmax, an HE-friendly argmax algorithm that reduces cipher operations compared to prior methods, enabling practical greedy decoding under encryption.
arXiv Detail & Related papers (2025-09-10T08:23:14Z)
EARN: Efficient Inference Acceleration for LLM-based Generative Recommendation by Register Tokens [47.60523011706102]
Large Language Model-based generative recommendation (LLMRec) has achieved notable success, but it suffers from high inference latency.<n>We propose EARN, an efficient inference framework that leverages the early layers to compress information into register tokens placed at the input sequence boundaries.
arXiv Detail & Related papers (2025-07-01T12:42:06Z)
MOFHEI: Model Optimizing Framework for Fast and Efficient Homomorphically Encrypted Neural Network Inference [0.8388591755871735]
Homomorphic Encryption (HE) enables us to perform machine learning tasks over encrypted data.<n>We propose MOFHEI, a framework that optimize the model to make HE-based neural network inference, fast and efficient.<n>Our framework achieves up to 98% pruning ratio on LeNet, eliminating up to 93% of the required HE operations for performing PI.
arXiv Detail & Related papers (2024-12-10T22:44:54Z)
Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding.<n>PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models.<n>Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z)
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z)
Optimized Layerwise Approximation for Efficient Private Inference on Fully Homomorphic Encryption [17.429712940497843]
This study proposes an optimized layerwise approximation (OLA) framework for privacy-preserving deep neural networks.<n>For efficient approximation, we reflect the layerwise accuracy by considering the actual input distribution of each activation function.<n>As a result, the OLA method reduces inference times for the ResNet-20 model and the ResNet-32 model by 3.02 times and 2.82 times, respectively.
arXiv Detail & Related papers (2023-10-16T12:34:47Z)
NAF: Neural Attenuation Fields for Sparse-View CBCT Reconstruction [79.13750275141139]
This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction. The desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network. A learning-based encoder entailing hash coding is adopted to help the network capture high-frequency details.
arXiv Detail & Related papers (2022-09-29T04:06:00Z)
Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep. We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
Efficient Pruning for Machine Learning Under Homomorphic Encryption [2.2817485071636376]
Privacy-preserving machine learning (PPML) solutions are gaining widespread popularity. Many rely on homomorphic encryption (HE) that offers confidentiality of the model and the data, but at the cost of large latency and memory requirements. We introduce a framework called HE-PEx that comprises new pruning methods, on top of a packing technique called tile tensors, for reducing the latency and memory of PPML inference.
arXiv Detail & Related papers (2022-07-07T15:49:24Z)
FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks. Current networks often occupy large number of parameters and require heavy computation costs. Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z)
A Variational Information Bottleneck Based Method to Compress Sequential Networks for Human Action Recognition [9.414818018857316]
We propose a method to effectively compress Recurrent Neural Networks (RNNs) used for Human Action Recognition (HAR) We use a Variational Information Bottleneck (VIB) theory-based pruning approach to limit the information flow through the sequential cells of RNNs to a small subset. We combine our pruning method with a specific group-lasso regularization technique that significantly improves compression. It is shown that our method achieves over 70 times greater compression than the nearest competitor with comparable accuracy for the task of action recognition on UCF11.
arXiv Detail & Related papers (2020-10-03T12:41:51Z)
Faster Secure Data Mining via Distributed Homomorphic Encryption [108.77460689459247]
Homomorphic Encryption (HE) is receiving more and more attention recently for its capability to do computations over the encrypted field. We propose a novel general distributed HE-based data mining framework towards one step of solving the scaling problem. We verify the efficiency and effectiveness of our new framework by testing over various data mining algorithms and benchmark data-sets.
arXiv Detail & Related papers (2020-06-17T18:14:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.