KiD: A Hardware Design Framework Targeting Unified NTT Multiplication for CRYSTALS-Kyber and CRYSTALS-Dilithium on FPGA
- URL: http://arxiv.org/abs/2311.04581v1
- Date: Wed, 8 Nov 2023 10:26:13 GMT
- Title: KiD: A Hardware Design Framework Targeting Unified NTT Multiplication for CRYSTALS-Kyber and CRYSTALS-Dilithium on FPGA
- Authors: Suraj Mandal, Debapriya Basu Roy,
- Abstract summary: Large-degree standalone multiplication is an integral component of post-quantum secure lattice-based cryptographic algorithms like CRYSTALS-Kyber and Dilithium.
In this paper, we aim to develop a unified and shared NTT architecture that can support multiplication for both CRYSTALS-Kyber and Dilithium.
- Score: 1.134327592583549
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-degree polynomial multiplication is an integral component of post-quantum secure lattice-based cryptographic algorithms like CRYSTALS-Kyber and Dilithium. The computational complexity of large-degree polynomial multiplication can be reduced significantly through Number Theoretic Transformation (NTT). In this paper, we aim to develop a unified and shared NTT architecture that can support polynomial multiplication for both CRYSTALS-Kyber and Dilithium. More specifically, in this paper, we have proposed three different unified architectures for NTT multiplication in CRYSTALS-Kyber and Dilithium with varying numbers of configurable radix-2 butterfly units. Additionally, the developed implementation is coupled with a conflict-free memory mapping scheme that allows the architecture to be fully pipelined. We have validated our implementation on Artix-7, Zynq-7000 and Zynq Ultrascale+ FPGAs. Our standalone implementations for NTT multiplication for CRYSTALS-Kyber and Dilithium perform better than the existing works, and our unified architecture shows excellent area and timing performance compared to both standalone and existing unified implementations. This architecture can potentially be used for compact and efficient implementation for CRYSTALS-Kyber and Dilithium.
Related papers
- AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation [48.82264764771652]
We introduce AsCAN -- a hybrid architecture, combining both convolutional and transformer blocks.
AsCAN supports a variety of tasks: recognition, segmentation, class-conditional image generation.
We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance.
arXiv Detail & Related papers (2024-11-07T18:43:17Z) - Trinity: A General Purpose FHE Accelerator [17.213234642867537]
We present the first multi-modal FHE accelerator based on a unified architecture, which efficiently supports CKKS, TFHE, and their conversion scheme within a single accelerator.
We propose a novel FHE accelerator named Trinity, which incorporates algorithm optimizations, hardware component reuse, and dynamic workload scheduling.
arXiv Detail & Related papers (2024-10-17T10:02:38Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - HF-NTT: Hazard-Free Dataflow Accelerator for Number Theoretic Transform [2.4578723416255754]
Polynomial multiplication is one of the fundamental operations in many applications, such as fully homomorphic encryption (FHE)
The Numberoretic Transform (NTT) has proven an effective tool in enhancing adaptable multiplication, but a fast method for generating NTT accelerators is lacking.
In this paper, we introduce HF-NTT, a novel NTT accelerator, and introduce a data movement strategy that eliminates the need for bit-reversal operations.
arXiv Detail & Related papers (2024-10-07T07:31:38Z) - PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler [1.6673953344957533]
We introduce a new polyhedral scheduler, PolyTOPS, that can be adjusted to various scenarios with straightforward, high-level configurations.
PolyTOPS has been used with isl and CLooG as code generators and has been integrated in MindSpore deep learning compiler.
arXiv Detail & Related papers (2024-01-12T16:11:27Z) - KyberMat: Efficient Accelerator for Matrix-Vector Polynomial Multiplication in CRYSTALS-Kyber Scheme via NTT and Polyphase Decomposition [20.592217626952507]
CRYSTAL-Kyber (Kyber) is one of the post-quantum cryptography (PQC) key-encapsulation mechanism (KEM) schemes selected during the standardization process.
This paper addresses optimization for Kyber architecture with respect to latency and throughput constraints.
arXiv Detail & Related papers (2023-10-06T22:57:25Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Squeezeformer: An Efficient Transformer for Automatic Speech Recognition [99.349598600887]
Conformer is the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture.
We propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes.
arXiv Detail & Related papers (2022-06-02T06:06:29Z) - Reconfigurable co-processor architecture with limited numerical
precision to accelerate deep convolutional neural networks [0.38848561367220275]
Convolutional Neural Networks (CNNs) are widely used in deep learning applications, e.g. visual systems, robotics etc.
Here, we present a model-independent reconfigurable co-processing architecture to accelerate CNNs.
In contrast to existing solutions, we introduce limited precision 32 bit Q-format fixed point quantization for arithmetic representations and operations.
arXiv Detail & Related papers (2021-08-21T09:50:54Z) - Towards Accurate and Compact Architectures via Neural Architecture
Transformer [95.4514639013144]
It is necessary to optimize the operations inside an architecture to improve the performance without introducing extra computational cost.
We have proposed a Neural Architecture Transformer (NAT) method which casts the optimization problem into a Markov Decision Process (MDP)
We propose a Neural Architecture Transformer++ (NAT++) method which further enlarges the set of candidate transitions to improve the performance of architecture optimization.
arXiv Detail & Related papers (2021-02-20T09:38:10Z) - PolyDL: Polyhedral Optimizations for Creation of High Performance DL
primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives.
We develop novel data reuse analysis algorithms using the polyhedral model.
We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.