Related papers: @NTT: Algorithm-Targeted NTT hardware acceleration via Design-Time Constant Optimization

@NTT: Algorithm-Targeted NTT hardware acceleration via Design-Time Constant Optimization

URL: http://arxiv.org/abs/2601.17806v1
Date: Sun, 25 Jan 2026 11:48:24 GMT
Title: @NTT: Algorithm-Targeted NTT hardware acceleration via Design-Time Constant Optimization
Authors: Mohammed Nabeel, Mahmoud Hafez, Michail Maniatakos,
Abstract summary: @NTT exploits the fact that the ring parameters in these algorithms are fixed, enabling design-time constant optimization.<n>Our case study on the Dilithium NTT, implemented using the TSMC 28 nm library, operates at a clock frequency of 1.0 GHz.<n>On FPGA, the design achieves a throughput-per-LUT that is 5.2x higher than the state-of-the-art implementation.
Score: 4.080796345570048
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Number Theoretic Transform (NTT) is a critical computational bottleneck in many lattice-based postquantum cryptographic (PQC) algorithms. By leveraging the Fast Fourier Transform (FFT) algorithm, the NTT of a polynomial of degree N - 1 can be computed with a time complexity of O(N log N). Hardware implementation of NTT is generally preferred over software ones, as the latter are significantly slower due to complex memory access patterns and modular arithmetic operations. Achieving maximum throughput in hardware, however, typically demands a prohibitively large number of butterfly unit instantiations. In this work, we propose @NTT, which exploits the fact that the ring parameters in these algorithms are fixed, enabling design-time constant optimization and achieving the maximum throughput of N-point NTT per clock cycle with a compact hardware footprint. Our case study on the Dilithium NTT, implemented using the TSMC 28 nm library, operates at a clock frequency of 1.0 GHz with an area of 1.45 mm^2. On FPGA, the design achieves a throughput-per-LUT that is 5.2x higher than the state-of-the-art implementation.

Related papers

Fast attention mechanisms: a tale of parallelism [52.7657529272906]
We introduce an efficient attention mechanism called Approximate Nearest Neighbor Attention (ANNA) with sub-quadratic time complexity.<n>We prove that ANNA-transformers retain the expressive power previously established for standard attention in terms of matching the capabilities of MPC algorithms.
arXiv Detail & Related papers (2025-09-10T20:59:44Z)
SCE-NTT: A Hardware Accelerator for Number Theoretic Transform Using Superconductor Electronics [12.616265554244313]
This research explores the use of superconductor electronics (SCE) for accelerating fully homomorphic encryption (FHE)<n>We present SCE-NTT, a dedicated hardware accelerator based on superconductive single flux quantum (SFQ) logic and memory.<n>We show the NTT-128 unit achieves 531 million NTT/sec at 34 GHz, over 100x faster than state-of-the-art CMOS equivalents.
arXiv Detail & Related papers (2025-08-28T23:37:51Z)
Generalized tensor transforms and their applications in classical and quantum computing [0.0]
We introduce a novel framework for Generalized Transforms (GTTs), constructed through an $n$-fold tensor product of an arbitrary $b times b$ unitary matrix $W$.<n>For quantum applications, our GTT-based algorithm achieves both gate complexity and circuit depth of $O(log_b N)$, where $N = bn$ denotes the length of the vector input.<n>We explore diverse applications of GTTs in quantum computing, including quantum state compression and transmission, function encoding and quantum digital signal processing.
arXiv Detail & Related papers (2025-07-03T08:28:00Z)
GDNTT: an Area-Efficient Parallel NTT Accelerator Using Glitch-Driven Near-Memory Computing and Reconfigurable 10T SRAM [14.319119105134309]
This paper proposes an area-efficient highly parallel NTT accelerator with glitch-driven near-memory computing (GDNTT)<n>The design integrates a 10T for data storage, enabling flexible row/column data access and streamlining circuit mapping strategies.<n> Evaluation results show that the proposed NTT accelerator achieves a 1.528* improvement in throughput-per-area compared to the state-of-the-art.
arXiv Detail & Related papers (2025-05-13T01:53:07Z)
A Unified Hardware Accelerator for Fast Fourier Transform and Number Theoretic Transform [0.0]
Number Theoretic Transform (NTT) is indispensable tool for computing efficient multiplications in post-quantum lattice-based cryptography.<n>We demonstrate a unified hardware accelerator supporting both 512-point complex FFT and 256-point NTT.<n>Our implementation achieves performance comparable to state-of-the-art ML-KEM / ML-DSA NTT accelerators on FPGA.
arXiv Detail & Related papers (2025-04-15T12:13:05Z)
HF-NTT: Hazard-Free Dataflow Accelerator for Number Theoretic Transform [2.4578723416255754]
Polynomial multiplication is one of the fundamental operations in many applications, such as fully homomorphic encryption (FHE) The Numberoretic Transform (NTT) has proven an effective tool in enhancing adaptable multiplication, but a fast method for generating NTT accelerators is lacking. In this paper, we introduce HF-NTT, a novel NTT accelerator, and introduce a data movement strategy that eliminates the need for bit-reversal operations.
arXiv Detail & Related papers (2024-10-07T07:31:38Z)
Quantum Circuit Optimization with AlphaTensor [47.9303833600197]
We develop AlphaTensor-Quantum, a method to minimize the number of T gates that are needed to implement a given circuit. Unlike existing methods for T-count optimization, AlphaTensor-Quantum can incorporate domain-specific knowledge about quantum computation and leverage gadgets. Remarkably, it discovers an efficient algorithm akin to Karatsuba's method for multiplication in finite fields.
arXiv Detail & Related papers (2024-02-22T09:20:54Z)
End-to-end codesign of Hessian-aware quantized neural networks for FPGAs and ASICs [49.358119307844035]
We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs) This makes efficient NN implementations in hardware accessible to nonexperts, in a single open-sourced workflow. We demonstrate the workflow in a particle physics application involving trigger decisions that must operate at the 40 MHz collision rate of the Large Hadron Collider (LHC) We implement an optimized mixed-precision NN for high-momentum particle jets in simulated LHC proton-proton collisions.
arXiv Detail & Related papers (2023-04-13T18:00:01Z)
LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics [45.666822327616046]
This work presents a novel reconfigurable architecture for Low Graph Neural Network (LL-GNN) designs for particle detectors. The LL-GNN design advances the next generation of trigger systems by enabling sophisticated algorithms to process experimental data efficiently.
arXiv Detail & Related papers (2022-09-28T12:55:35Z)
Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks. The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources. This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z)
Decomposition of Matrix Product States into Shallow Quantum Circuits [62.5210028594015]
tensor network (TN) algorithms can be mapped to parametrized quantum circuits (PQCs) We propose a new protocol for approximating TN states using realistic quantum circuits. Our results reveal one particular protocol, involving sequential growth and optimization of the quantum circuit, to outperform all other methods.
arXiv Detail & Related papers (2022-09-01T17:08:41Z)
EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.