Related papers: Efficient and Flexible Differet-Radix Montgomery Modular Multiplication for Hardware Implementation

Efficient and Flexible Differet-Radix Montgomery Modular Multiplication for Hardware Implementation

URL: http://arxiv.org/abs/2407.12701v1
Date: Wed, 17 Jul 2024 16:24:15 GMT
Title: Efficient and Flexible Differet-Radix Montgomery Modular Multiplication for Hardware Implementation
Authors: Yuxuan Zhang, Hua Guo, Chen Chen, Yewei Guan, Xiyong Zhang, Zhenyu Guan,
Abstract summary: We propose an efficient parallel variant of iterative Montgomery modular multiplication, called DRMMM, that allows the quotient can be computed in multiple iterations. Based on proposed variant, we also design high-performance hardware implementation architecture for faster operation.
Score: 14.516310806294433
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Montgomery modular multiplication is widely-used in public key cryptosystems (PKC) and affects the efficiency of upper systems directly. However, modulus is getting larger due to the increasing demand of security, which results in a heavy computing cost. High-performance implementation of Montgomery modular multiplication is urgently required to ensure the highly-efficient operations in PKC. However, existing high-speed implementations still need a large amount redundant computing to simplify the intermediate result. Supports to the redundant representation is extremely limited on Montgomery modular multiplication. In this paper, we propose an efficient parallel variant of iterative Montgomery modular multiplication, called DRMMM, that allows the quotient can be computed in multiple iterations. In this variant, terms in intermediate result and the quotient in each iteration are computed in different radix such that computation of the quotient can be pipelined. Based on proposed variant, we also design high-performance hardware implementation architecture for faster operation. In the architecture, intermediate result in every iteration is denoted as three parts to free from redundant computations. Finally, to support FPGA-based systems, we design operators based on FPGA underlying architecture for better area-time performance. The result of implementation and experiment shows that our method reduces the output latency by 38.3\% than the fastest design on FPGA.

Related papers

LaMoS: Enabling Efficient Large Number Modular Multiplication through SRAM-based CiM Acceleration [16.444656025445713]
We introduce LaMoS, an efficient-based Computing-in-Memory (CiM) design for large-number modular multiplication.<n>LaMoS achieves a $7.02times$ speedup and reduces high bit-width scaling costs compared to existing CiM designs.
arXiv Detail & Related papers (2025-11-05T10:20:26Z)
Towards a Functionally Complete and Parameterizable TFHE Processor [3.907410857035328]
TFHE is a fast torus-based fully homomorphic encryption scheme.<n>It provides the fastest bootstrapping operation performance of any other FHE scheme.<n>It suffers from a considerably higher computational overhead for the evaluation of homomorphic circuits.<n>We propose an FPGA-based hardware accelerator for the evaluation of homomorphic circuits.
arXiv Detail & Related papers (2025-10-27T16:16:40Z)
Scaling Probabilistic Circuits via Monarch Matrices [109.65822339230853]
Probabilistic Circuits (PCs) are tractable representations of probability distributions.<n>We propose a novel sparse and structured parameterization for the sum blocks in PCs.
arXiv Detail & Related papers (2025-06-14T07:39:15Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE. Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
A High-Speed Hardware Algorithm for Modulus Operation and its Application in Prime Number Calculation [0.0]
The proposed algorithm use only addition, subtraction, logical, and bit shift operations. It addresses scalability challenges in cryptographic applications. The application of this algorithm in prime number calculation up to 500,000 shows its practical utility and performance advantages.
arXiv Detail & Related papers (2024-07-17T13:24:52Z)
Fast, Scalable, Energy-Efficient Non-element-wise Matrix Multiplication on FPGA [10.630802853096462]
Modern Neural Network (NN) architectures heavily rely on vast numbers of multiply-accumulate arithmetic operations. This paper proposes a high- throughput, scalable and energy efficient non-element-wise matrix multiplication unit on FPGAs. Using our AMU achieves up to 9x higher throughput and 112x higher energy efficiency over the state-of-the-art solutions for the FPGA-based Quantised Neural Network (QNN) accelerators.
arXiv Detail & Related papers (2024-07-02T15:28:10Z)
All-to-all reconfigurability with sparse and higher-order Ising machines [0.0]
We introduce a multiplexed architecture that emulates all-to-all network functionality. We show that running the adaptive parallel tempering algorithm demonstrates competitive algorithmic and prefactor advantages. scaled magnetic versions of p-bit IMs could lead to orders of magnitude improvements over the state of the art for generic optimization.
arXiv Detail & Related papers (2023-11-21T20:27:02Z)
KyberMat: Efficient Accelerator for Matrix-Vector Polynomial Multiplication in CRYSTALS-Kyber Scheme via NTT and Polyphase Decomposition [20.592217626952507]
CRYSTAL-Kyber (Kyber) is one of the post-quantum cryptography (PQC) key-encapsulation mechanism (KEM) schemes selected during the standardization process. This paper addresses optimization for Kyber architecture with respect to latency and throughput constraints.
arXiv Detail & Related papers (2023-10-06T22:57:25Z)
An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z)
Multiplierless Design of High-Speed Very Large Constant Multiplications [3.5382618288815495]
In cryptographic algorithms, the constants to be multiplied by a variable can be very large due to security requirements. We introduce an electronic design automation tool, called LEIGER, which can automatically generate the realizations of very large constant multiplications.
arXiv Detail & Related papers (2023-09-11T15:35:02Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
High-performance symbolic-numerics via multiple dispatch [52.77024349608834]
Symbolics.jl is an extendable symbolic system which uses dynamic multiple dispatch to change behavior depending on the domain needs. We show that by formalizing a generic API on actions independent of implementation, we can retroactively add optimized data structures to our system. We demonstrate the ability to swap between classical term-rewriting simplifiers and e-graph-based term-rewriting simplifiers.
arXiv Detail & Related papers (2021-05-09T14:22:43Z)
Iterative Algorithm Induced Deep-Unfolding Neural Networks: Precoding Design for Multiuser MIMO Systems [59.804810122136345]
We propose a framework for deep-unfolding, where a general form of iterative algorithm induced deep-unfolding neural network (IAIDNN) is developed. An efficient IAIDNN based on the structure of the classic weighted minimum mean-square error (WMMSE) iterative algorithm is developed. We show that the proposed IAIDNN efficiently achieves the performance of the iterative WMMSE algorithm with reduced computational complexity.
arXiv Detail & Related papers (2020-06-15T02:57:57Z)
Minimal Filtering Algorithms for Convolutional Neural Networks [82.24592140096622]
We develop fully parallel hardware-oriented algorithms for implementing the basic filtering operation for M=3,5,7,9, and 11. A fully parallel hardware implementation of the proposed algorithms in each case gives approximately 30 percent savings in the number of embedded multipliers.
arXiv Detail & Related papers (2020-04-12T13:18:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.