HF-NTT: Hazard-Free Dataflow Accelerator for Number Theoretic Transform
- URL: http://arxiv.org/abs/2410.04805v1
- Date: Mon, 7 Oct 2024 07:31:38 GMT
- Title: HF-NTT: Hazard-Free Dataflow Accelerator for Number Theoretic Transform
- Authors: Xiangchen Meng, Zijun Jiang, Yangdi Lyu,
- Abstract summary: Polynomial multiplication is one of the fundamental operations in many applications, such as fully homomorphic encryption (FHE)
The Numberoretic Transform (NTT) has proven an effective tool in enhancing adaptable multiplication, but a fast method for generating NTT accelerators is lacking.
In this paper, we introduce HF-NTT, a novel NTT accelerator, and introduce a data movement strategy that eliminates the need for bit-reversal operations.
- Score: 2.4578723416255754
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Polynomial multiplication is one of the fundamental operations in many applications, such as fully homomorphic encryption (FHE). However, the computational inefficiency stemming from polynomials with many large-bit coefficients poses a significant challenge for the practical implementation of FHE. The Number Theoretic Transform (NTT) has proven an effective tool in enhancing polynomial multiplication, but a fast and adaptable method for generating NTT accelerators is lacking. In this paper, we introduce HF-NTT, a novel NTT accelerator. HF-NTT efficiently handles polynomials of varying degrees and moduli, allowing for a balance between performance and hardware resources by adjusting the number of Processing Elements (PEs). Meanwhile, we introduce a data movement strategy that eliminates the need for bit-reversal operations, resolves different hazards, and reduces the clock cycles. Furthermore, Our accelerator includes a hardware-friendly modular multiplication design and a configurable PE capable of adapting its data path, resulting in a universal architecture. We synthesized and implemented prototype using Vivado 2022.2, and evaluated it on the Xilinx Virtex-7 FPGA platform. The results demonstrate significant improvements in Area-Time-Product (ATP) and processing speed for different polynomial degrees. In scenarios involving multi-modulus polynomial multiplication, our prototype consistently outperforms other designs in both ATP and latency metrics.
Related papers
- EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community.
We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z) - MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for
Accelerating Vision-Language Transformer [66.71930982549028]
Vision-Language Transformers (VLTs) have shown great success recently, but are accompanied by heavy computation costs.
We propose a novel framework named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for accelerating various VLTs.
arXiv Detail & Related papers (2024-03-05T14:13:50Z) - KiD: A Hardware Design Framework Targeting Unified NTT Multiplication for CRYSTALS-Kyber and CRYSTALS-Dilithium on FPGA [1.134327592583549]
Large-degree standalone multiplication is an integral component of post-quantum secure lattice-based cryptographic algorithms like CRYSTALS-Kyber and Dilithium.
In this paper, we aim to develop a unified and shared NTT architecture that can support multiplication for both CRYSTALS-Kyber and Dilithium.
arXiv Detail & Related papers (2023-11-08T10:26:13Z) - KyberMat: Efficient Accelerator for Matrix-Vector Polynomial Multiplication in CRYSTALS-Kyber Scheme via NTT and Polyphase Decomposition [20.592217626952507]
CRYSTAL-Kyber (Kyber) is one of the post-quantum cryptography (PQC) key-encapsulation mechanism (KEM) schemes selected during the standardization process.
This paper addresses optimization for Kyber architecture with respect to latency and throughput constraints.
arXiv Detail & Related papers (2023-10-06T22:57:25Z) - PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA
LUT-based Inference [3.1999570171901786]
We show that by using building blocks, we can achieve the same accuracy using fewer layers of soft logic than by using linear functions.
We demonstrate the effectiveness of this approach in three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and handwritten digit recognition using the MNIST dataset.
arXiv Detail & Related papers (2023-09-05T15:54:09Z) - TPU as Cryptographic Accelerator [13.44836928672667]
Cryptographic schemes like Fully Homomorphic Encryption (FHE) and Zero-Knowledge Proofs (ZKPs) are often hindered by their computational complexity.
This paper explores the potential of leveraging TPUs/NPUs to accelerate cryptographic multiplication, thereby enhancing the performance of FHE and ZKP schemes.
arXiv Detail & Related papers (2023-07-13T04:38:32Z) - Transform Once: Efficient Operator Learning in Frequency Domain [69.74509540521397]
We study deep neural networks designed to harness the structure in frequency domain for efficient learning of long-range correlations in space or time.
This work introduces a blueprint for frequency domain learning through a single transform: transform once (T1)
arXiv Detail & Related papers (2022-11-26T01:56:05Z) - Batch Processing and Data Streaming Fourier-based Convolutional Neural
Network Accelerator [4.7257913147626995]
Decision-making by artificial neural networks with minimal latency is paramount for numerous applications such as navigation, tracking, and real-time machine action systems.
This requires the machine learning hardware to handle multidimensional data with a high throughput.
We demonstrate a non-van Neuman-based machine learning acceleration with a Fourier Convolutional Neural Network (FCNN) accelerator.
arXiv Detail & Related papers (2021-12-23T01:06:17Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - DiffPD: Differentiable Projective Dynamics with Contact [65.88720481593118]
We present DiffPD, an efficient differentiable soft-body simulator with implicit time integration.
We evaluate the performance of DiffPD and observe a speedup of 4-19 times compared to the standard Newton's method in various applications.
arXiv Detail & Related papers (2021-01-15T00:13:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.