HF-NTT: Hazard-Free Dataflow Accelerator for Number Theoretic Transform
- URL: http://arxiv.org/abs/2410.04805v1
- Date: Mon, 7 Oct 2024 07:31:38 GMT
- Title: HF-NTT: Hazard-Free Dataflow Accelerator for Number Theoretic Transform
- Authors: Xiangchen Meng, Zijun Jiang, Yangdi Lyu,
- Abstract summary: Polynomial multiplication is one of the fundamental operations in many applications, such as fully homomorphic encryption (FHE)
The Numberoretic Transform (NTT) has proven an effective tool in enhancing adaptable multiplication, but a fast method for generating NTT accelerators is lacking.
In this paper, we introduce HF-NTT, a novel NTT accelerator, and introduce a data movement strategy that eliminates the need for bit-reversal operations.
- Score: 2.4578723416255754
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Polynomial multiplication is one of the fundamental operations in many applications, such as fully homomorphic encryption (FHE). However, the computational inefficiency stemming from polynomials with many large-bit coefficients poses a significant challenge for the practical implementation of FHE. The Number Theoretic Transform (NTT) has proven an effective tool in enhancing polynomial multiplication, but a fast and adaptable method for generating NTT accelerators is lacking. In this paper, we introduce HF-NTT, a novel NTT accelerator. HF-NTT efficiently handles polynomials of varying degrees and moduli, allowing for a balance between performance and hardware resources by adjusting the number of Processing Elements (PEs). Meanwhile, we introduce a data movement strategy that eliminates the need for bit-reversal operations, resolves different hazards, and reduces the clock cycles. Furthermore, Our accelerator includes a hardware-friendly modular multiplication design and a configurable PE capable of adapting its data path, resulting in a universal architecture. We synthesized and implemented prototype using Vivado 2022.2, and evaluated it on the Xilinx Virtex-7 FPGA platform. The results demonstrate significant improvements in Area-Time-Product (ATP) and processing speed for different polynomial degrees. In scenarios involving multi-modulus polynomial multiplication, our prototype consistently outperforms other designs in both ATP and latency metrics.
Related papers
- Joint Transmit and Pinching Beamforming for PASS: Optimization-Based or Learning-Based? [89.05848771674773]
A novel antenna system ()-enabled downlink multi-user multiple-input single-output (MISO) framework is proposed.
It consists of multiple waveguides, which equip numerous low-cost antennas, named (PAs)
The positions of PAs can be reconfigured to both spanning large-scale path and space.
arXiv Detail & Related papers (2025-02-12T18:54:10Z) - PolyLUT: Ultra-low Latency Polynomial Inference with Hardware-Aware Structured Pruning [8.791770352147989]
We propose a novel approach to training DNNs for FPGA deployment using CERNs as the basic building block.
Our method takes advantage of the flexibility offered by soft logic, hiding the evaluation inside the LUTs with minimal overhead.
We demonstrate the effectiveness of PolyLUT on three tasks: network intrusion detection, jet identification at the Large Hadron Collider, and MNIST.
arXiv Detail & Related papers (2025-01-14T11:51:57Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.
Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community.
We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z) - KyberMat: Efficient Accelerator for Matrix-Vector Polynomial Multiplication in CRYSTALS-Kyber Scheme via NTT and Polyphase Decomposition [20.592217626952507]
CRYSTAL-Kyber (Kyber) is one of the post-quantum cryptography (PQC) key-encapsulation mechanism (KEM) schemes selected during the standardization process.
This paper addresses optimization for Kyber architecture with respect to latency and throughput constraints.
arXiv Detail & Related papers (2023-10-06T22:57:25Z) - PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA
LUT-based Inference [3.1999570171901786]
We show that by using building blocks, we can achieve the same accuracy using fewer layers of soft logic than by using linear functions.
We demonstrate the effectiveness of this approach in three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and handwritten digit recognition using the MNIST dataset.
arXiv Detail & Related papers (2023-09-05T15:54:09Z) - TPU as Cryptographic Accelerator [13.44836928672667]
Cryptographic schemes like Fully Homomorphic Encryption (FHE) and Zero-Knowledge Proofs (ZKPs) are often hindered by their computational complexity.
This paper explores the potential of leveraging TPUs/NPUs to accelerate cryptographic multiplication, thereby enhancing the performance of FHE and ZKP schemes.
arXiv Detail & Related papers (2023-07-13T04:38:32Z) - Transform Once: Efficient Operator Learning in Frequency Domain [69.74509540521397]
We study deep neural networks designed to harness the structure in frequency domain for efficient learning of long-range correlations in space or time.
This work introduces a blueprint for frequency domain learning through a single transform: transform once (T1)
arXiv Detail & Related papers (2022-11-26T01:56:05Z) - Batch Processing and Data Streaming Fourier-based Convolutional Neural
Network Accelerator [4.7257913147626995]
Decision-making by artificial neural networks with minimal latency is paramount for numerous applications such as navigation, tracking, and real-time machine action systems.
This requires the machine learning hardware to handle multidimensional data with a high throughput.
We demonstrate a non-van Neuman-based machine learning acceleration with a Fourier Convolutional Neural Network (FCNN) accelerator.
arXiv Detail & Related papers (2021-12-23T01:06:17Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - DiffPD: Differentiable Projective Dynamics with Contact [65.88720481593118]
We present DiffPD, an efficient differentiable soft-body simulator with implicit time integration.
We evaluate the performance of DiffPD and observe a speedup of 4-19 times compared to the standard Newton's method in various applications.
arXiv Detail & Related papers (2021-01-15T00:13:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.