SCE-NTT: A Hardware Accelerator for Number Theoretic Transform Using Superconductor Electronics
- URL: http://arxiv.org/abs/2508.21265v1
- Date: Thu, 28 Aug 2025 23:37:51 GMT
- Title: SCE-NTT: A Hardware Accelerator for Number Theoretic Transform Using Superconductor Electronics
- Authors: Sasan Razmkhah, Mingye Li, Zeming Cheng, Robert S. Aviles, Kyle Jackman, Joey Delport, Lieze Schindler, Wenhui Luo, Takuya Suzuki, Mehdi Kamal, Christopher L. Ayala, Coenrad J. Fourie, Nabuyuki Yoshikawa, Peter A. Beerel, Sandeep Gupta, Massoud Pedram,
- Abstract summary: This research explores the use of superconductor electronics (SCE) for accelerating fully homomorphic encryption (FHE)<n>We present SCE-NTT, a dedicated hardware accelerator based on superconductive single flux quantum (SFQ) logic and memory.<n>We show the NTT-128 unit achieves 531 million NTT/sec at 34 GHz, over 100x faster than state-of-the-art CMOS equivalents.
- Score: 12.616265554244313
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This research explores the use of superconductor electronics (SCE) for accelerating fully homomorphic encryption (FHE), focusing on the Number-Theoretic Transform (NTT), a key computational bottleneck in FHE schemes. We present SCE-NTT, a dedicated hardware accelerator based on superconductive single flux quantum (SFQ) logic and memory, targeting high performance and energy efficiency beyond the limits of conventional CMOS. To address SFQ constraints such as limited dense RAM and restricted fanin/fanout, we propose a deeply pipelined NTT-128 architecture using shift register memory (SRM). Designed for N=128 32-bit coefficients, NTT-128 comprises log2(N)=7 processing elements (PEs), each featuring a butterfly unit (BU), dual coefficient memories operating in ping-pong mode via FIFO-based SRM queues, and twiddle factor buffers. The BU integrates a Shoup modular multiplier optimized for a small area, leveraging precomputed twiddle factors. A new RSFQ cell library with over 50 parameterized cells, including compound logic units, was developed for implementation. Functional and timing correctness were validated using JoSIM analog simulations and Verilog models. A multiphase clocking scheme was employed to enhance robustness and reduce path-balancing overhead, improving circuit reliability. Fabricated results show the NTT-128 unit achieves 531 million NTT/sec at 34 GHz, over 100x faster than state-of-the-art CMOS equivalents. We also project that the architecture can scale to larger sizes, such as a 2^14-point NTT in approximately 482 ns. Key-switch throughput is estimated at 1.63 million operations/sec, significantly exceeding existing hardware. These results demonstrate the strong potential of SCE-based accelerators for scalable, energy-efficient secure computation in the post-quantum era, with further gains anticipated through advances in fabrication.
Related papers
- @NTT: Algorithm-Targeted NTT hardware acceleration via Design-Time Constant Optimization [4.080796345570048]
@NTT exploits the fact that the ring parameters in these algorithms are fixed, enabling design-time constant optimization.<n>Our case study on the Dilithium NTT, implemented using the TSMC 28 nm library, operates at a clock frequency of 1.0 GHz.<n>On FPGA, the design achieves a throughput-per-LUT that is 5.2x higher than the state-of-the-art implementation.
arXiv Detail & Related papers (2026-01-25T11:48:24Z) - A Constant-Time Hardware Architecture for the CSIDH Key-Exchange Protocol [0.6597195879147555]
This paper presents the first comprehensive hardware study of CSIDH on both FPGA and ASIC platforms.<n>The constant-time CSIDH-512 design requires $1.03times108$ clock cycles per key generation.<n>For ASIC implementation in a 180nm process, the design requires $1.065times108$ clock cycles and achieves a asciitilde 180 MHz frequency, resulting in a key generation latency of 591 ms.
arXiv Detail & Related papers (2025-08-14T21:37:29Z) - Resource Analysis of Low-Overhead Transversal Architectures for Reconfigurable Atom Arrays [38.6948808036416]
We present a low-overhead architecture that supports the layout and resource estimation of large-scale fault-tolerant quantum algorithms.<n>We find that a 2048-bit RSA factoring can be executed with 19 million qubits in 5.6 days, for 1 ms QEC cycle times.
arXiv Detail & Related papers (2025-05-21T18:00:18Z) - GDNTT: an Area-Efficient Parallel NTT Accelerator Using Glitch-Driven Near-Memory Computing and Reconfigurable 10T SRAM [14.319119105134309]
This paper proposes an area-efficient highly parallel NTT accelerator with glitch-driven near-memory computing (GDNTT)<n>The design integrates a 10T for data storage, enabling flexible row/column data access and streamlining circuit mapping strategies.<n> Evaluation results show that the proposed NTT accelerator achieves a 1.528* improvement in throughput-per-area compared to the state-of-the-art.
arXiv Detail & Related papers (2025-05-13T01:53:07Z) - A Unified Hardware Accelerator for Fast Fourier Transform and Number Theoretic Transform [0.0]
Number Theoretic Transform (NTT) is indispensable tool for computing efficient multiplications in post-quantum lattice-based cryptography.<n>We demonstrate a unified hardware accelerator supporting both 512-point complex FFT and 256-point NTT.<n>Our implementation achieves performance comparable to state-of-the-art ML-KEM / ML-DSA NTT accelerators on FPGA.
arXiv Detail & Related papers (2025-04-15T12:13:05Z) - Neural Signal Compression using RAMAN tinyML Accelerator for BCI Applications [2.036583412151438]
Large-scale brain recordings produce vast amounts of data that must be wirelessly transmitted for offline analysis and decoding.<n>We propose a neural signal compression scheme utilizing Convolutional Autoencoders (CAEs)<n>CAEs achieves a compression ratio of up to 150 for compressing local field potentials (LFPs)<n> RAMAN is an energy-efficient tinyML accelerator designed for edge computing.
arXiv Detail & Related papers (2025-04-09T16:09:00Z) - Performance Characterization of a Multi-Module Quantum Processor with Static Inter-Chip Couplers [63.42120407991982]
Three-dimensional integration technologies such as flip-chip bonding are a key prerequisite to realize large-scale superconducting quantum processors.<n>We present a design for a multi-chip module comprising one carrier chip and four qubit modules.<n>Measuring two of the qubits, we analyze the readout performance, finding a mean three-level state-assignment error of $9 times 10-3$ in 200 ns.<n>We demonstrate a controlled-Z two-qubit gate in 100 ns with an error of $7 times 10-3$ extracted from interleaved randomized benchmarking.
arXiv Detail & Related papers (2025-03-16T18:32:44Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Analog Spiking Neuron in CMOS 28 nm Towards Large-Scale Neuromorphic Processors [0.8426358786287627]
In this work, we present a low-power Leaky Integrate-and-Fire neuron design fabricated in TSMC's 28 nm CMOS technology.
The fabricated neuron consumes 1.61 fJ/spike and occupies an active area of 34 $mu m2$, leading to a maximum spiking frequency of 300 kHz at 250 mV power supply.
arXiv Detail & Related papers (2024-08-14T17:51:20Z) - A Heterogeneous RISC-V based SoC for Secure Nano-UAV Navigation [40.8381466360025]
nano-UAVs face significant power and payload constraints while requiring advanced computing capabilities.
We present Shaheen, a 9mm2 200mW system-on-a-chip (SoC)
It integrates a Linux-capable RV64 core, compliant with the v1.0 ratified Hypervisor extension, along with a low-cost and low-power memory controller.
At the same time, it integrates a fully programmable energy- and area-efficient multi-core cluster of RV32 cores optimized for general-purpose DSP.
arXiv Detail & Related papers (2024-01-07T16:03:47Z) - SupeRBNN: Randomized Binary Neural Network Using Adiabatic
Superconductor Josephson Devices [44.440915387556544]
AQFP devices serve as excellent carriers for binary neural network (BNN) computations.
We propose SupeRBNN, an AQFP-based randomized BNN acceleration framework.
We show that our design achieves an energy efficiency of approximately 7.8x104 times higher than that of the ReRAM-based BNN framework.
arXiv Detail & Related papers (2023-09-21T16:14:42Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware
Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks.
We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z) - Low Latency CMOS Hardware Acceleration for Fully Connected Layers in
Deep Neural Networks [1.9036571490366496]
The FC accelerator, FC-ACCL, is based on 128 8x8 or 16x16 processing elements for matrix-vector multiplication.
The design can reduce latency for the large FC6 layer by 60 % in AlexNet and by 3 % in VGG16 when compared to an alternative EIE solution.
arXiv Detail & Related papers (2020-11-25T15:49:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.