Related papers: Yet another Improvement of Plantard Arithmetic for Faster Kyber on Low-end 32-bit IoT Devices

Yet another Improvement of Plantard Arithmetic for Faster Kyber on Low-end 32-bit IoT Devices

URL: http://arxiv.org/abs/2309.00440v3
Date: Sun, 18 Feb 2024 05:49:49 GMT
Title: Yet another Improvement of Plantard Arithmetic for Faster Kyber on Low-end 32-bit IoT Devices
Authors: Junhao Huang, Haosong Zhao, Jipeng Zhang, Wangchen Dai, Lu Zhou, Ray C.C. Cheung, Cetin Kaya Koc, Donglong Chen
Abstract summary: We show that the input range of the Plantard multiplication by a constant is at least 2.14 times larger than the original design in TCHES2022. We propose various optimization strategies for NTT/INTT. Our NTT/INTT implementation shows considerable speedups compared to the state-of-the-art work.
Score: 14.32828779824487
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents another improved version of Plantard arithmetic that could speed up Kyber implementations on two low-end 32-bit IoT platforms (ARM Cortex-M3 and RISC-V) without SIMD extensions. Specifically, we further enlarge the input range of the Plantard arithmetic without modifying its computation steps. After tailoring the Plantard arithmetic for Kyber's modulus, we show that the input range of the Plantard multiplication by a constant is at least 2.14 times larger than the original design in TCHES2022. Then, two optimization techniques for efficient Plantard arithmetic on Cortex-M3 and RISC-V are presented. We show that the Plantard arithmetic supersedes both Montgomery and Barrett arithmetic on low-end 32-bit platforms. With the enlarged input range and the efficient implementation of the Plantard arithmetic on these platforms, we propose various optimization strategies for NTT/INTT. We minimize or entirely eliminate the modular reduction of coefficients in NTT/INTT by taking advantage of the larger input range of the proposed Plantard arithmetic on low-end 32-bit platforms. Furthermore, we propose two memory optimization strategies that reduce 23.50% to 28.31% stack usage for the speed-version Kyber implementation when compared to its counterpart on Cortex-M4. The proposed optimizations make the speed-version implementation more feasible on low-end IoT devices. Thanks to the aforementioned optimizations, our NTT/INTT implementation shows considerable speedups compared to the state-of-the-art work. Overall, we demonstrate the applicability of the speed-version Kyber implementation on memory-constrained IoT platforms and set new speed records for Kyber on these platforms.

Related papers

Orthogonal Finetuning Made Scalable [87.49040247077389]
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment.<n>We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity.<n>We propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic.<n>These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance.
arXiv Detail & Related papers (2025-06-24T17:59:49Z)
Cost-Effective Optimization and Implementation of the CRT-Paillier Decryption Algorithm for Enhanced Performance [0.0]
We propose an eCRT-Paillier decryption algorithm that shortens the decryption computation chain.<n>These two improvements reduce 50% modular multiplications and 60% judgment operations in the postprocessing of the CRT-Paillier decryption algorithm.<n>A high- throughput and efficient Paillier accelerator named MESA was implemented on the Xilinx Virtex-7 FPGA for evaluation.
arXiv Detail & Related papers (2025-06-22T08:06:36Z)
The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference [0.9954176833299684]
Deep learning (DL) has led to a shift from traditional 64-bit floating point (FP64) computations toward reduced-precision formats.<n>This paper revisits traditional high-performance gemm and describes strategies for adapting it to mixed-precision integer arithmetic.
arXiv Detail & Related papers (2025-06-13T12:40:16Z)
Efficient Hardware Implementation of Modular Multiplier over GF (2m) on FPGA [0.10241134756773226]
Elliptic curve cryptography (ECC) has emerged as the dominant public-key protocol.<n>This work presents a hardware implementation of a Hybrid multiplication technique for modular multiplication over binary field GF(2m)<n>The design optimize the combination of conventional multiplication (CM) and Karatsuba multiplication (KM) to enhance elliptic curve point multiplication (ECPM)<n>Results show the hybrid technique significantly improves speed, hardware efficiency, and resource utilization for ECC cryptographic systems.
arXiv Detail & Related papers (2025-06-11T07:14:05Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE. Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework. MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions. Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z)
Optimizing the Deployment of Tiny Transformers on Low-Power MCUs [12.905978154498499]
This work aims to enable and optimize the flexible, multi-platform deployment of encoder Tiny Transformers on commercial MCUs. Our framework provides an optimized library of kernels to maximize data reuse and avoid data marshaling operations into the crucial attention block. We show that our MHSA depth-first tiling scheme reduces the memory peak by up to 6.19x, while the fused-weight attention can reduce the runtime by 1.53x, and number of parameters by 25%.
arXiv Detail & Related papers (2024-04-03T14:14:08Z)
Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval. We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z)
KyberMat: Efficient Accelerator for Matrix-Vector Polynomial Multiplication in CRYSTALS-Kyber Scheme via NTT and Polyphase Decomposition [20.592217626952507]
CRYSTAL-Kyber (Kyber) is one of the post-quantum cryptography (PQC) key-encapsulation mechanism (KEM) schemes selected during the standardization process. This paper addresses optimization for Kyber architecture with respect to latency and throughput constraints.
arXiv Detail & Related papers (2023-10-06T22:57:25Z)
Reduced Precision Floating-Point Optimization for Deep Neural Network On-Device Learning on MicroControllers [15.37318446043671]
This paper introduces a novel reduced precision optimization technique for On-Device Learning (ODL) primitives on MCU-class devices. Our approach results more than two orders of magnitude faster than existing ODL software frameworks for single-core MCUs.
arXiv Detail & Related papers (2023-05-30T16:14:16Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
Practical Conformer: Optimizing size, speed and flops of Conformer for on-Device and cloud ASR [67.63332492134332]
We design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs. Our proposed encoder can double as a strong standalone encoder in on device, and as the first part of a high-performance ASR pipeline.
arXiv Detail & Related papers (2023-03-31T23:30:48Z)
8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values. This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters. In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z)
Easy and Efficient Transformer : Scalable Inference Solution For large NLP mode [14.321889138798072]
This paper introduces a series of ultra-large-scale pre-training model optimization methods. An inference engine -- Easy and Efficient Transformer (EET) is proposed. EET achieves a 1.5-15x state-of-art speedup varying with context length.
arXiv Detail & Related papers (2021-04-26T11:00:56Z)
Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration [14.958793135751149]
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM) Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead. We address a key architectural challenge with structural sparsity: how to provide support for a range of sparsity levels while maintaining high utilization of the hardware.
arXiv Detail & Related papers (2020-09-04T20:17:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.