Towards Closing the Performance Gap for Cryptographic Kernels Between CPUs and Specialized Hardware
- URL: http://arxiv.org/abs/2509.12494v1
- Date: Mon, 15 Sep 2025 22:35:00 GMT
- Title: Towards Closing the Performance Gap for Cryptographic Kernels Between CPUs and Specialized Hardware
- Authors: Naifeng Zhang, Sophia Fu, Franz Franchetti,
- Abstract summary: We develop an optimized implementation of cryptographic kernels for x86 CPUs at the per-core level.<n>We propose a small AVX-512 extension, dubbed multi-word extension (MQX)<n>MQX cuts the slowdown relative to ASICs to as low as 35 times on a single CPU core.
- Score: 0.07646713951724009
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Specialized hardware like application-specific integrated circuits (ASICs) remains the primary accelerator type for cryptographic kernels based on large integer arithmetic. Prior work has shown that commodity and server-class GPUs can achieve near-ASIC performance for these workloads. However, achieving comparable performance on CPUs remains an open challenge. This work investigates the following question: How can we narrow the performance gap between CPUs and specialized hardware for key cryptographic kernels like basic linear algebra subprograms (BLAS) operations and the number theoretic transform (NTT)? To this end, we develop an optimized scalar implementation of these kernels for x86 CPUs at the per-core level. We utilize SIMD instructions (specifically AVX2 and AVX-512) to further improve performance, achieving an average speedup of 38 times and 62 times over state-of-the-art CPU baselines for NTTs and BLAS operations, respectively. To narrow the gap further, we propose a small AVX-512 extension, dubbed multi-word extension (MQX), which delivers substantial speedup with only three new instructions and minimal proposed hardware modifications. MQX cuts the slowdown relative to ASICs to as low as 35 times on a single CPU core. Finally, we perform a roofline analysis to evaluate the peak performance achievable with MQX when scaled across an entire multi-core CPU. Our results show that, with MQX, top-tier server-grade CPUs can approach the performance of state-of-the-art ASICs for cryptographic workloads.
Related papers
- CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation [51.72529978689561]
Agent is a large-scale agentic reinforcement learning system that develops kernel expertise through three components.<n>Agent delivers 100%, 100%, and 92% faster rate over torchcompile on KernelBench.
arXiv Detail & Related papers (2026-02-27T18:58:05Z) - KScaNN: Scalable Approximate Nearest Neighbor Search on Kunpeng [46.35664429179457]
A naive port of existing x86 ANNS algorithms to ARM platforms results in a substantial performance deficit.<n>We introduce KScaNN, a novel ANNS algorithm co-designed for the Kunpeng 920 ARM architecture.
arXiv Detail & Related papers (2025-11-05T09:01:32Z) - Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution [1.3356260369011272]
We propose a lightweight synchronization mechanism based on OpenCL fine-grained shared virtual memory (SVM) and machine learning models to accurately predict execution times.<n>A comprehensive evaluation on four mobile platforms shows that our approach can quickly select CPU- GPU co-execution strategies achieving up to 1.89x speedup for linear layers and 1.75x speedup for convolutional layers.
arXiv Detail & Related papers (2025-10-24T01:41:43Z) - Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures [3.2645124275315163]
Large language model (LLM)-based inference workloads increasingly dominate data center costs and resource utilization.<n>This paper presents an in-depth analysis of inference behavior on loosely-coupled ( PCIe A100/H100) and closely-coupled (GH200) systems.
arXiv Detail & Related papers (2025-04-16T04:02:39Z) - Q-GEAR: Improving quantum simulation framework [0.28402080392117757]
We introduce Q-Gear, a software framework that transforms Qiskit quantum circuits into Cuda-Q kernels.<n>Q-Gear accelerates both CPU and GPU based simulations by respectively two orders of magnitude and ten times with minimal coding effort.
arXiv Detail & Related papers (2025-04-04T22:17:51Z) - gECC: A GPU-based high-throughput framework for Elliptic Curve Cryptography [15.39096542261856]
Elliptic Curve Cryptography (ECC) is an encryption method that provides security comparable to traditional techniques like Rivest-Shamir-Adleman (RSA)<n> ECC is still hindered by the significant performance overhead associated with elliptic curve (EC) operations.<n>This paper presents gECC, a versatile framework for ECC optimized for GPU architectures.
arXiv Detail & Related papers (2024-12-22T01:50:50Z) - Hybrid quantum programming with PennyLane Lightning on HPC platforms [0.0]
PennyLane's Lightning suite is a collection of high-performance state-vector simulators targeting CPU, GPU, and HPC-native architectures and workloads.
Quantum applications such as QAOA, VQE, and synthetic workloads are implemented to demonstrate the supported classical computing architectures.
arXiv Detail & Related papers (2024-03-04T22:01:03Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Combining processing throughput, low latency and timing accuracy in
experiment control [0.0]
We ported the firmware of the ARTIQ experiment control infrastructure to an embedded system based on a commercial Xilinx Zynq-7000 system-on-chip.
It contains high-performance hardwired CPU cores integrated with FPGA fabric.
arXiv Detail & Related papers (2021-11-30T11:11:02Z) - DistGNN: Scalable Distributed Training for Large-Scale Graph Neural
Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible.
In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters.
Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z) - Optimizing Deep Learning Recommender Systems' Training On CPU Cluster
Architectures [56.69373580921888]
We focus on Recommender Systems which account for most of the AI cycles in cloud computing centers.
By enabling it to run on latest CPU hardware and software tailored for HPC, we are able to achieve more than two-orders of magnitude improvement in performance.
arXiv Detail & Related papers (2020-05-10T14:40:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.