Related papers: Area Efficient Modular Reduction in Hardware for Arbitrary Static Moduli

Area Efficient Modular Reduction in Hardware for Arbitrary Static Moduli

URL: http://arxiv.org/abs/2308.15079v1
Date: Tue, 29 Aug 2023 07:26:20 GMT
Title: Area Efficient Modular Reduction in Hardware for Arbitrary Static Moduli
Authors: Robin Müller, Willi Meier, Christoph F. Wildfeuer,
Abstract summary: We propose a novel approach for computing modular reduction efficiently in hardware for arbitrary static moduli. Our method can be executed in constant time, which is essential for cryptography applications.
Score: 3.217374402111224
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modular reduction is a crucial operation in many post-quantum cryptographic schemes, including the Kyber key exchange method or Dilithium signature scheme. However, it can be computationally expensive and pose a performance bottleneck in hardware implementations. To address this issue, we propose a novel approach for computing modular reduction efficiently in hardware for arbitrary static moduli. Unlike other commonly used methods such as Barrett or Montgomery reduction, the method does not require any multiplications. It is not dependent on properties of any particular choice of modulus for good performance and low area consumption. Its major strength lies in its low area consumption, which was reduced by 60% for optimized and up to 90% for generic Barrett implementations for Kyber and Dilithium. Additionally, it is well suited for parallelization and pipelining and scales linearly in hardware resource consumption with increasing operation width. All operations can be performed in the bit-width of the modulus, rather than the size of the number being reduced. This shortens carry chains and allows for faster clocking. Moreover, our method can be executed in constant time, which is essential for cryptography applications where timing attacks can be used to obtain information about the secret key.

Related papers

LaMoS: Enabling Efficient Large Number Modular Multiplication through SRAM-based CiM Acceleration [16.444656025445713]
We introduce LaMoS, an efficient-based Computing-in-Memory (CiM) design for large-number modular multiplication.<n>LaMoS achieves a $7.02times$ speedup and reduces high bit-width scaling costs compared to existing CiM designs.
arXiv Detail & Related papers (2025-11-05T10:20:26Z)
Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression [57.54335545892155]
We introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook.<n>Our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines.
arXiv Detail & Related papers (2025-10-23T20:19:48Z)
A Modular, Adaptive, and Scalable Quantum Factoring Algorithm [0.5729426778193398]
Shor's algorithm for integer factorization offers an exponential speedup over classical methods.<n>It remains impractical on Noisy Intermediate Scale Quantum (NISQ) hardware due to the need for many coherent qubits and very deep circuits.<n>We have developed a modular, windowed formulation of Shor's algorithm that mitigates these limitations.
arXiv Detail & Related papers (2025-09-05T11:21:10Z)
Orthogonal Finetuning Made Scalable [87.49040247077389]
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment.<n>We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity.<n>We propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic.<n>These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance.
arXiv Detail & Related papers (2025-06-24T17:59:49Z)
CommVQ: Commutative Vector Quantization for KV Cache Compression [50.37946553931796]
We propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference.<n>We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache.<n>Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook.
arXiv Detail & Related papers (2025-06-23T17:50:11Z)
ALLMod: Exploring $\underline{\mathbf{A}}$rea-Efficiency of $\underline{\mathbf{L}}$UT-based $\underline{\mathbf{L}}$arge Number $\underline{\mathbf{Mod}}$ular Reduction via Hybrid Workloads [18.634794494170617]
High-bit-width operations are crucial for enhancing security. They are computationally intensive due to the large number of modular operations required. AllMod is a novel approach that improves the area efficiency of LUT-based large-number modular reduction.
arXiv Detail & Related papers (2025-03-20T07:47:34Z)
Leveraging ASIC AI Chips for Homomorphic Encryption [12.209134343914537]
homomorphic encryption (HE) offers strong privacy guarantee, but it requires substantially more resources than computing on plaintext. accelerators have emerged to mitigate this latency issue, but with the high cost of ASICs. We show that HE primitives can be converted to AI operators and accelerated on existing ASIC AI accelerators, like TPUs, which are already widely deployed in the cloud.
arXiv Detail & Related papers (2025-01-13T04:08:14Z)
gECC: A GPU-based high-throughput framework for Elliptic Curve Cryptography [15.39096542261856]
Elliptic Curve Cryptography (ECC) is an encryption method that provides security comparable to traditional techniques like Rivest-Shamir-Adleman (RSA) ECC is still hindered by the significant performance overhead associated with elliptic curve (EC) operations. This paper presents gECC, a versatile framework for ECC optimized for GPU architectures.
arXiv Detail & Related papers (2024-12-22T01:50:50Z)
BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments [53.71158537264695]
Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. We introduce textbfBitStack, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance.
arXiv Detail & Related papers (2024-10-31T13:26:11Z)
Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores [3.6385567224218556]
Large language models (LLMs) have been widely applied but face challenges in efficient inference. We introduce a novel bipolar-INT data format that facilitates parallel computing and supports symmetric quantization. We implement an arbitrary precision matrix multiplication scheme that decomposes and recovers at the bit level, enabling flexible precision.
arXiv Detail & Related papers (2024-09-26T14:17:58Z)
Efficient and Flexible Differet-Radix Montgomery Modular Multiplication for Hardware Implementation [14.516310806294433]
We propose an efficient parallel variant of iterative Montgomery modular multiplication, called DRMMM, that allows the quotient can be computed in multiple iterations. Based on proposed variant, we also design high-performance hardware implementation architecture for faster operation.
arXiv Detail & Related papers (2024-07-17T16:24:15Z)
Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models. HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks. A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z)
Accurate Block Quantization in LLMs with Outliers [0.6138671548064355]
The demand for inference on extremely large scale LLMs has seen enormous growth in recent months. The problem is aggravated by the exploding raise in the lengths of the sequences being processed. Various quantization techniques have been proposed that allow accurate quantization for both weights and activations.
arXiv Detail & Related papers (2024-03-29T12:15:06Z)
ModSRAM: Algorithm-Hardware Co-Design for Large Number Modular Multiplication in SRAM [7.949839381468341]
Elliptic curve cryptography (ECC) is widely used in security applications such as public key cryptography (CPK) and zero-knowledge proofs (ZKP)
arXiv Detail & Related papers (2024-02-21T22:26:44Z)
ReLU and Addition-based Gated RNN [1.484528358552186]
We replace the multiplication and sigmoid function of the conventional recurrent gate with addition and ReLU activation. This mechanism is designed to maintain long-term memory for sequence processing but at a reduced computational cost.
arXiv Detail & Related papers (2023-08-10T15:18:16Z)
Constant Memory Attention Block [74.38724530521277]
Constant Memory Attention Block (CMAB) is a novel general-purpose attention block that computes its output in constant memory and performs updates in constant computation. We show our proposed methods achieve results competitive with state-of-the-art while being significantly more memory efficient.
arXiv Detail & Related papers (2023-06-21T22:41:58Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
Modular decoding: parallelizable real-time decoding for quantum computers [55.41644538483948]
Real-time quantum computation will require decoding algorithms capable of extracting logical outcomes from a stream of data generated by noisy quantum hardware. We propose modular decoding, an approach capable of addressing this challenge with minimal additional communication and without sacrificing decoding accuracy. We introduce the edge-vertex decomposition, a concrete instance of modular decoding for lattice-surgery style fault-tolerant blocks.
arXiv Detail & Related papers (2023-03-08T19:26:10Z)
Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers. Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.