Related papers: Cost-Effective Optimization and Implementation of the CRT-Paillier Decryption Algorithm for Enhanced Performance

Cost-Effective Optimization and Implementation of the CRT-Paillier Decryption Algorithm for Enhanced Performance

URL: http://arxiv.org/abs/2506.17935v1
Date: Sun, 22 Jun 2025 08:06:36 GMT
Title: Cost-Effective Optimization and Implementation of the CRT-Paillier Decryption Algorithm for Enhanced Performance
Authors: Zhengwu Huang, Ding Deng, Pengyue Sun, Guangfu Sun, Xiaomei Tang,
Abstract summary: We propose an eCRT-Paillier decryption algorithm that shortens the decryption computation chain.<n>These two improvements reduce 50% modular multiplications and 60% judgment operations in the postprocessing of the CRT-Paillier decryption algorithm.<n>A high- throughput and efficient Paillier accelerator named MESA was implemented on the Xilinx Virtex-7 FPGA for evaluation.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To address the privacy protection problem in cloud computing, privacy enhancement techniques such as the Paillier additive homomorphism algorithm are receiving widespread attention. Paillier algorithm allows addition and scalar multiplication operations in dencrypted state, which can effectively protect privacy. However, its computational efficiency is limited by complex modulo operations due to the ciphertext expansion followed by encryption. To accelerate its decryption operation, the Chinese Remainder Theorem (CRT) is often used to optimize these modulo operations, which lengthens the decryption computation chain in turn. To address this issue, we propose an eCRT-Paillier decryption algorithm that shortens the decryption computation chain by combining precomputed parameters and eliminating extra judgment operations introduced by Montgomery modular multiplications. These two improvements reduce 50% modular multiplications and 60% judgment operations in the postprocessing of the CRT-Paillier decryption algorithm. Based on these improvements, we propose a highly parallel full-pipeline architecture to eliminate stalls caused by multiplier reuse in traditional modular exponentiation operations. This architecture also adopts some optimizations such as simplifying modular exponentiation units by dividing the exponent into segments and parallelizing data flow by multi-core instantiation. Finally, a high-throughput and efficient Paillier accelerator named MESA was implemented on the Xilinx Virtex-7 FPGA for evaluation, which can complete a decryption using 2048-bit key within 0.577ms under 100 MHz clock frequency. Compared to prior works, MESA demonstrates a throughput improvement of 1.16 to 313.21 under identical conditions, also with enhancements in area efficiency for LUT, DSP, and FF of 3.32 to 117.55, 1.49 to 1.64, and 2.94 to 9.94, respectively.

Related papers

Orthogonal Finetuning Made Scalable [87.49040247077389]
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment.<n>We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity.<n>We propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic.<n>These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance.
arXiv Detail & Related papers (2025-06-24T17:59:49Z)
Efficient Hardware Implementation of Modular Multiplier over GF (2m) on FPGA [0.10241134756773226]
Elliptic curve cryptography (ECC) has emerged as the dominant public-key protocol.<n>This work presents a hardware implementation of a Hybrid multiplication technique for modular multiplication over binary field GF(2m)<n>The design optimize the combination of conventional multiplication (CM) and Karatsuba multiplication (KM) to enhance elliptic curve point multiplication (ECPM)<n>Results show the hybrid technique significantly improves speed, hardware efficiency, and resource utilization for ECC cryptographic systems.
arXiv Detail & Related papers (2025-06-11T07:14:05Z)
Fast correlated decoding of transversal logical algorithms [67.01652927671279]
Quantum error correction (QEC) is required for large-scale computation, but incurs a significant resource overhead.<n>Recent advances have shown that by jointly decoding logical qubits in algorithms composed of logical gates, the number of syndrome extraction rounds can be reduced.<n>Here, we reform the problem of decoding circuits by directly decoding relevant logical operator products as they propagate through the circuit.
arXiv Detail & Related papers (2025-05-19T18:00:00Z)
Learning Adaptive Parallel Reasoning with Language Models [70.1745752819628]
We propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end.<n> APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations.<n>A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures.
arXiv Detail & Related papers (2025-04-21T22:29:02Z)
Exploiting Unstructured Sparsity in Fully Homomorphic Encrypted DNNs [0.37570612254620583]
Deep neural networks (DNNs) in privacy-sensitive environments are constrained by computational overheads in fully homomorphic encryption (FHE)<n>This paper explores unstructured sparsity in FHE matrix multiplication schemes as a means of reducing this burden while maintaining model accuracy requirements.<n>We demonstrate that sparsity can be exploited in arbitrary matrix multiplication, providing runtime benefits compared to a baseline naive algorithm at all sparsity levels.
arXiv Detail & Related papers (2025-03-12T09:24:31Z)
CIPHERMATCH: Accelerating Homomorphic Encryption-Based String Matching via Memory-Efficient Data Packing and In-Flash Processing [8.114331115730021]
Homomorphic encryption (HE) allows secure computation on encrypted data without revealing the original data.<n>Many cloud computing applications (e.g., DNA read mapping, biometric matching, web search) use exact string matching as a key operation.<n>Prior string matching algorithms that use homomorphic encryption are limited by high computational latency.
arXiv Detail & Related papers (2025-03-12T00:25:58Z)
Leveraging ASIC AI Chips for Homomorphic Encryption [12.209134343914537]
homomorphic encryption (HE) offers strong privacy guarantee, but it requires substantially more resources than computing on plaintext.<n> accelerators have emerged to mitigate this latency issue, but with the high cost of ASICs.<n>We show that HE primitives can be converted to AI operators and accelerated on existing ASIC AI accelerators, like TPUs, which are already widely deployed in the cloud.
arXiv Detail & Related papers (2025-01-13T04:08:14Z)
UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.12010207132204]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings.<n>We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm.<n>UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z)
Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask [74.64216073678617]
AMD performs parallel NAR inference within contiguous blocks of output labels concealed using attention masks. A beam search algorithm is designed to leverage a dynamic fusion of CTC, AR Decoder, and AMD probabilities. Experiments on the LibriSpeech-100hr corpus suggest the tripartite Decoder incorporating the AMD module produces a maximum decoding speed-up ratio of 1.73x.
arXiv Detail & Related papers (2024-06-14T13:42:38Z)
SOCI^+: An Enhanced Toolkit for Secure OutsourcedComputation on Integers [50.608828039206365]
We propose SOCI+ which significantly improves the performance of SOCI. SOCI+ employs a novel (2, 2)-threshold Paillier cryptosystem with fast encryption and decryption as its cryptographic primitive. Compared with SOCI, our experimental evaluation shows that SOCI+ is up to 5.4 times more efficient in computation and 40% less in communication overhead.
arXiv Detail & Related papers (2023-09-27T05:19:32Z)
Efficient Additions and Montgomery Reductions of Large Integers for SIMD [2.362288417229025]
This paper presents efficient algorithms for performing Montgomery reductions and additions on integers larger than 512 bits. New addition algorithm simulates the addition of large integers using a smaller addition, quickly producing the same set of carries. For Montgomery reductions, serial multiplications are replaced with precomputations that can be effectively calculated using SIMD extensions.
arXiv Detail & Related papers (2023-08-31T03:44:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.