A High Performance and Efficient Post-Quantum Crypto-Processor for FrodoKEM
- URL: http://arxiv.org/abs/2601.16500v1
- Date: Fri, 23 Jan 2026 07:05:42 GMT
- Title: A High Performance and Efficient Post-Quantum Crypto-Processor for FrodoKEM
- Authors: Kai Li, Jiahao Lu, Fu Yao, Guang Zeng, Dongsheng Liu, Shengfei Gu, Zhengpeng Zhao, Jiachen Wang,
- Abstract summary: FrodoKEM is a lattice-based post-quantum key encapsulation mechanism (KEM)<n>It has been considered for standardization by the International Organization for Standardization (ISO)<n>This paper presents a high-performance and efficient crypto-processor for FrodoKEM.
- Score: 24.961829196441887
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: FrodoKEM is a lattice-based post-quantum key encapsulation mechanism (KEM). It has been considered for standardization by the International Organization for Standardization (ISO) due to its robust security profile. However, its hardware implementation exhibits a weakness of high latency and heavy resource burden, hindering its practical application. Moreover, diverse usage scenarios call for comprehensive functionality. To address these challenges, this paper presents a high-performance and efficient crypto-processor for FrodoKEM. A multiple-instruction overlapped execution scheme is introduced to enable efficient multi-module scheduling and minimize operational latency. Furthermore, a high-speed, reconfigurable parallel multiplier array is integrated to handle intensive matrix computations under diverse computation patterns, significantly enhancing hardware efficiency. In addition, a compact memory scheduling strategy shortens the lifespan of intermediate matrices, thereby reducing overall storage requirements. The proposed design provides full support for all FrodoKEM security levels and protocol phases. It consumes 13467 LUTs, 6042 FFs, and 14 BRAMs on an Artix-7 FPGA and achieves the fastest reported execution time. Compared with state-of-the-art hardware implementations, our design improves the area-time product (ATP) by 1.75-2.00 times.
Related papers
- ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling [56.88966608455977]
ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters.<n>ZipMoE achieves up to $72.77%$ inference latency reduction and up to $6.76times$ higher throughput than the state-of-the-art systems.
arXiv Detail & Related papers (2026-01-29T02:51:59Z) - Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z) - InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation [56.694702609077495]
Long-sequence processing is a critical capability for modern large language models.<n>InfLLM-V2 is a trainable sparse attention framework that seamlessly adapts models from short to long sequences.<n>In experiments, InfLLM-V2 is 4$times$ faster than dense attention while retaining 98.1% and 99.7% of the performance.
arXiv Detail & Related papers (2025-09-29T12:08:33Z) - A Scalable Architecture for Efficient Multi-bit Fully Homomorphic Encryption [1.4174227043241145]
We introduce Taurus, a hardware accelerator designed to enhance the efficiency of multi-bit TFHE computations.<n>Taurus supports ciphertexts up to 10 bits by leveraging novel FFT units and optimizing memory bandwidth through key reuse strategies.<n>Our experiment results demonstrate that Taurus achieves up to 2600x speedup over a CPU, 1200x speedup over a GPU, and up to 7x faster compared to the previous state-of-the-art accelerator.
arXiv Detail & Related papers (2025-09-16T05:00:57Z) - APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration [5.075697428779204]
Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance.<n>This is primarily due to the limited support for the GPU Cores, inefficient memory management, and inflexible kernel optimizations.<n>We propose a comprehensive acceleration scheme for arbitrary precision LLMs, namely APT-LLM.
arXiv Detail & Related papers (2025-08-26T14:48:29Z) - EFFACT: A Highly Efficient Full-Stack FHE Acceleration Platform [15.3973190088728]
EFFACT is a highly efficient full-stack FHE acceleration platform with a compiler that provides comprehensive optimizations and vector-friendly hardware.<n>For generality, EFFACT is also equipped with an ISA and a compiler backend that can support several FHE schemes like CKKS, BGV, and BFV.
arXiv Detail & Related papers (2025-04-22T12:01:20Z) - Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores [3.6385567224218556]
Large language models (LLMs) have been widely applied but face challenges in efficient inference.
We introduce a novel bipolar-INT data format that facilitates parallel computing and supports symmetric quantization.
We implement an arbitrary precision matrix multiplication scheme that decomposes and recovers at the bit level, enabling flexible precision.
arXiv Detail & Related papers (2024-09-26T14:17:58Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR)
CFSR inherits the advantages of both convolution-based and transformer-based approaches.
Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z) - KyberMat: Efficient Accelerator for Matrix-Vector Polynomial Multiplication in CRYSTALS-Kyber Scheme via NTT and Polyphase Decomposition [20.592217626952507]
CRYSTAL-Kyber (Kyber) is one of the post-quantum cryptography (PQC) key-encapsulation mechanism (KEM) schemes selected during the standardization process.
This paper addresses optimization for Kyber architecture with respect to latency and throughput constraints.
arXiv Detail & Related papers (2023-10-06T22:57:25Z) - REED: Chiplet-Based Accelerator for Fully Homomorphic Encryption [4.713756093611972]
We present the first-of-its-kind multi-chiplet-based FHE accelerator REED' for overcoming the limitations of prior monolithic designs.<n>Results demonstrate that REED 2.5D microprocessor consumes 96.7 mm$2$ chip area, 49.4 W average power in 7nm technology.
arXiv Detail & Related papers (2023-08-05T14:04:39Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware
Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks.
We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.