Related papers: KiD: A Hardware Design Framework Targeting Unified NTT Multiplication for CRYSTALS-Kyber and CRYSTALS-Dilithium on FPGA

KiD: A Hardware Design Framework Targeting Unified NTT Multiplication for CRYSTALS-Kyber and CRYSTALS-Dilithium on FPGA

URL: http://arxiv.org/abs/2311.04581v1
Date: Wed, 8 Nov 2023 10:26:13 GMT
Title: KiD: A Hardware Design Framework Targeting Unified NTT Multiplication for CRYSTALS-Kyber and CRYSTALS-Dilithium on FPGA
Authors: Suraj Mandal, Debapriya Basu Roy,
Abstract summary: Large-degree standalone multiplication is an integral component of post-quantum secure lattice-based cryptographic algorithms like CRYSTALS-Kyber and Dilithium. In this paper, we aim to develop a unified and shared NTT architecture that can support multiplication for both CRYSTALS-Kyber and Dilithium.
Score: 1.134327592583549
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-degree polynomial multiplication is an integral component of post-quantum secure lattice-based cryptographic algorithms like CRYSTALS-Kyber and Dilithium. The computational complexity of large-degree polynomial multiplication can be reduced significantly through Number Theoretic Transformation (NTT). In this paper, we aim to develop a unified and shared NTT architecture that can support polynomial multiplication for both CRYSTALS-Kyber and Dilithium. More specifically, in this paper, we have proposed three different unified architectures for NTT multiplication in CRYSTALS-Kyber and Dilithium with varying numbers of configurable radix-2 butterfly units. Additionally, the developed implementation is coupled with a conflict-free memory mapping scheme that allows the architecture to be fully pipelined. We have validated our implementation on Artix-7, Zynq-7000 and Zynq Ultrascale+ FPGAs. Our standalone implementations for NTT multiplication for CRYSTALS-Kyber and Dilithium perform better than the existing works, and our unified architecture shows excellent area and timing performance compared to both standalone and existing unified implementations. This architecture can potentially be used for compact and efficient implementation for CRYSTALS-Kyber and Dilithium.

Related papers

AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation [48.82264764771652]
We introduce AsCAN -- a hybrid architecture, combining both convolutional and transformer blocks. AsCAN supports a variety of tasks: recognition, segmentation, class-conditional image generation. We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance.
arXiv Detail & Related papers (2024-11-07T18:43:17Z)
Trinity: A General Purpose FHE Accelerator [17.213234642867537]
We present the first multi-modal FHE accelerator based on a unified architecture, which efficiently supports CKKS, TFHE, and their conversion scheme within a single accelerator. We propose a novel FHE accelerator named Trinity, which incorporates algorithm optimizations, hardware component reuse, and dynamic workload scheduling.
arXiv Detail & Related papers (2024-10-17T10:02:38Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE. Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
HF-NTT: Hazard-Free Dataflow Accelerator for Number Theoretic Transform [2.4578723416255754]
Polynomial multiplication is one of the fundamental operations in many applications, such as fully homomorphic encryption (FHE) The Numberoretic Transform (NTT) has proven an effective tool in enhancing adaptable multiplication, but a fast method for generating NTT accelerators is lacking. In this paper, we introduce HF-NTT, a novel NTT accelerator, and introduce a data movement strategy that eliminates the need for bit-reversal operations.
arXiv Detail & Related papers (2024-10-07T07:31:38Z)
PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler [1.6673953344957533]
We introduce a new polyhedral scheduler, PolyTOPS, that can be adjusted to various scenarios with straightforward, high-level configurations. PolyTOPS has been used with isl and CLooG as code generators and has been integrated in MindSpore deep learning compiler.
arXiv Detail & Related papers (2024-01-12T16:11:27Z)
KyberMat: Efficient Accelerator for Matrix-Vector Polynomial Multiplication in CRYSTALS-Kyber Scheme via NTT and Polyphase Decomposition [20.592217626952507]
CRYSTAL-Kyber (Kyber) is one of the post-quantum cryptography (PQC) key-encapsulation mechanism (KEM) schemes selected during the standardization process. This paper addresses optimization for Kyber architecture with respect to latency and throughput constraints.
arXiv Detail & Related papers (2023-10-06T22:57:25Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
Squeezeformer: An Efficient Transformer for Automatic Speech Recognition [99.349598600887]
Conformer is the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture. We propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes.
arXiv Detail & Related papers (2022-06-02T06:06:29Z)
Reconfigurable co-processor architecture with limited numerical precision to accelerate deep convolutional neural networks [0.38848561367220275]
Convolutional Neural Networks (CNNs) are widely used in deep learning applications, e.g. visual systems, robotics etc. Here, we present a model-independent reconfigurable co-processing architecture to accelerate CNNs. In contrast to existing solutions, we introduce limited precision 32 bit Q-format fixed point quantization for arithmetic representations and operations.
arXiv Detail & Related papers (2021-08-21T09:50:54Z)
Towards Accurate and Compact Architectures via Neural Architecture Transformer [95.4514639013144]
It is necessary to optimize the operations inside an architecture to improve the performance without introducing extra computational cost. We have proposed a Neural Architecture Transformer (NAT) method which casts the optimization problem into a Markov Decision Process (MDP) We propose a Neural Architecture Transformer++ (NAT++) method which further enlarges the set of candidate transitions to improve the performance of architecture optimization.
arXiv Detail & Related papers (2021-02-20T09:38:10Z)
PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives. We develop novel data reuse analysis algorithms using the polyhedral model. We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.