Efficient Additions and Montgomery Reductions of Large Integers for SIMD
- URL: http://arxiv.org/abs/2308.16432v1
- Date: Thu, 31 Aug 2023 03:44:49 GMT
- Title: Efficient Additions and Montgomery Reductions of Large Integers for SIMD
- Authors: Pengchang Ren, Reiji Suda, Vorapong Suppakitpaisarn,
- Abstract summary: This paper presents efficient algorithms for performing Montgomery reductions and additions on integers larger than 512 bits.
New addition algorithm simulates the addition of large integers using a smaller addition, quickly producing the same set of carries.
For Montgomery reductions, serial multiplications are replaced with precomputations that can be effectively calculated using SIMD extensions.
- Score: 2.362288417229025
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper presents efficient algorithms, designed to leverage SIMD for performing Montgomery reductions and additions on integers larger than 512 bits. The existing algorithms encounter inefficiencies when parallelized using SIMD due to extensive dependencies in both operations, particularly noticeable in costly operations like ARM's SVE. To mitigate this problem, a novel addition algorithm is introduced that simulates the addition of large integers using a smaller addition, quickly producing the same set of carries. These carries are then utilized to perform parallel additions on large integers. For Montgomery reductions, serial multiplications are replaced with precomputations that can be effectively calculated using SIMD extensions. Experimental evidence demonstrates that these proposed algorithms substantially enhance the performance of state-of-the-art implementations of several post-quantum cryptography algorithms. Notably, they deliver a 30% speed-up from the latest CTIDH implementation, an 11% speed-up from the latest CSIDH implementation in AVX-512 processors, and a 7% speed-up from Microsoft's standard PQCrypto-SIDH for SIKEp503 on A64FX.
Related papers
- Efficient and Flexible Differet-Radix Montgomery Modular Multiplication for Hardware Implementation [14.516310806294433]
We propose an efficient parallel variant of iterative Montgomery modular multiplication, called DRMMM, that allows the quotient can be computed in multiple iterations.
Based on proposed variant, we also design high-performance hardware implementation architecture for faster operation.
arXiv Detail & Related papers (2024-07-17T16:24:15Z) - A High-Speed Hardware Algorithm for Modulus Operation and its Application in Prime Number Calculation [0.0]
The proposed algorithm use only addition, subtraction, logical, and bit shift operations.
It addresses scalability challenges in cryptographic applications.
The application of this algorithm in prime number calculation up to 500,000 shows its practical utility and performance advantages.
arXiv Detail & Related papers (2024-07-17T13:24:52Z) - BoostCom: Towards Efficient Universal Fully Homomorphic Encryption by Boosting the Word-wise Comparisons [14.399750086329345]
Fully Homomorphic Encryption (FHE) allows for the execution of computations on encrypted data without the need to decrypt it first.
In this paper, we introduce BoostCom, a scheme designed to speed up word-wise comparison operations.
We achieve an end-to-end performance improvement of more than an order of magnitude (11.1x faster) compared to the state-of-the-art CPU-based uFHE systems.
arXiv Detail & Related papers (2024-07-10T02:09:10Z) - SAGA: Synthesis Augmentation with Genetic Algorithms for In-Memory Sequence Optimization [0.0]
MAGIC, or Memristor Aided Logic, is an approach which uses memory circuits which physically perform computation through write operations to memory.
We detail the formation and implementation of these genetic algorithms and evaluate them over a number of open circuit implementations.
Over the 10 benchmark circuits evaluated, these modifications lead to an overall improvement in the efficiency of in-memory circuit evaluation of 128% in the best case and 27.5% on average.
arXiv Detail & Related papers (2024-06-14T03:00:42Z) - Fast, Scalable, Warm-Start Semidefinite Programming with Spectral
Bundling and Sketching [53.91395791840179]
We present Unified Spectral Bundling with Sketching (USBS), a provably correct, fast and scalable algorithm for solving massive SDPs.
USBS provides a 500x speed-up over the state-of-the-art scalable SDP solver on an instance with over 2 billion decision variables.
arXiv Detail & Related papers (2023-12-19T02:27:22Z) - An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks.
The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions.
We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z) - Decreasing the Computing Time of Bayesian Optimization using
Generalizable Memory Pruning [56.334116591082896]
We show a wrapper of memory pruning and bounded optimization capable of being used with any surrogate model and acquisition function.
Running BO on high-dimensional or massive data sets becomes intractable due to this time complexity.
All model implementations are run on the MIT Supercloud state-of-the-art computing hardware.
arXiv Detail & Related papers (2023-09-08T14:05:56Z) - Rapid Person Re-Identification via Sub-space Consistency Regularization [51.76876061721556]
Person Re-Identification (ReID) matches pedestrians across disjoint cameras.
Existing ReID methods adopting real-value feature descriptors have achieved high accuracy, but they are low in efficiency due to the slow Euclidean distance computation.
We propose a novel Sub-space Consistency Regularization (SCR) algorithm that can speed up the ReID procedure by 0.25$ times.
arXiv Detail & Related papers (2022-07-13T02:44:05Z) - Providing Meaningful Data Summarizations Using Examplar-based Clustering
in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms.
We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z) - Iterative Algorithm Induced Deep-Unfolding Neural Networks: Precoding
Design for Multiuser MIMO Systems [59.804810122136345]
We propose a framework for deep-unfolding, where a general form of iterative algorithm induced deep-unfolding neural network (IAIDNN) is developed.
An efficient IAIDNN based on the structure of the classic weighted minimum mean-square error (WMMSE) iterative algorithm is developed.
We show that the proposed IAIDNN efficiently achieves the performance of the iterative WMMSE algorithm with reduced computational complexity.
arXiv Detail & Related papers (2020-06-15T02:57:57Z) - Fast Implementation of Morphological Filtering Using ARM NEON Extension [0.9135092203041721]
We consider speedup potential of morphological image filtering on ARM processors.
We propose fast implementation of erosion and dilation using ARM SIMD extension NEON.
Experiments showed 3 times efficiency increase for final implementation of erosion and dilation.
arXiv Detail & Related papers (2020-02-19T12:55:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.