Ramp Up NTT in Record Time using GPU-Accelerated Algorithms and LLM-based Code Generation
- URL: http://arxiv.org/abs/2502.11110v1
- Date: Sun, 16 Feb 2025 12:53:23 GMT
- Title: Ramp Up NTT in Record Time using GPU-Accelerated Algorithms and LLM-based Code Generation
- Authors: Yu Cui, Hang Fu, Licheng Wang, Haibin Zhang,
- Abstract summary: Homomorphic encryption (HE) is a core building block in privacy-preserving machine learning (PPML)<n>Many GPU-accelerated cryptographic schemes have been proposed to improve the performance of HE.<n>Given the powerful code generation capabilities of large language models (LLMs), we aim to explore their potential to automatically generate practical GPU-friendly algorithm code.
- Score: 11.120838175165986
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Homomorphic encryption (HE) is a core building block in privacy-preserving machine learning (PPML), but HE is also widely known as its efficiency bottleneck. Therefore, many GPU-accelerated cryptographic schemes have been proposed to improve the performance of HE. However, these methods often require complex modifications tailored to specific algorithms and are tightly coupled with specific GPU and operating systems. It is interesting to ask how to generally offer more practical GPU-accelerated cryptographic algorithm implementations. Given the powerful code generation capabilities of large language models (LLMs), we aim to explore their potential to automatically generate practical GPU-friendly algorithm code using CPU-friendly code. In this paper, we focus on number theoretic transform (NTT) -- the core mechanism of HE. We first develop and optimize a GPU-friendly NTT (GNTT) family that exploits PyTorch's fast matrix computation and precomputation, achieving an approximately 62x speedup -- a significant boost over existing ones. Then we explore GPU-friendly code generation using various LLMs, including DeepSeek-R1, OpenAI o1 and o3-mini. We discover many interesting findings throughout the process. For instance, somewhat surprisingly, our experiments demonstrate that DeepSeek-R1 significantly outperforms OpenAI o3-mini and o1, but still cannot beat our optimized protocol. The findings provide valuable insights for turbocharging PPML and enhancing code generation capabilities of LLMs. Codes are available at: https://github.com/LMPC-Lab/GenGPUCrypto.
Related papers
- A Parallel CPU-GPU Framework for Cost-Bounded DFS with Applications to IDA* and BTS [13.186524200050957]
We introduce a method for GPU computations in depth first search.<n>This is used to create algorithms like emphsynchronous IDA*, an extension of the Iterative Deepening A* (IDA*) algorithm.<n>We evaluate the approach on the 3x3's Rubik Cube and 4x4 sliding tile puzzle (STP), showing that GPU operations can be efficiently batched in DFS.
arXiv Detail & Related papers (2025-07-16T05:07:33Z) - QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm [24.09018606185114]
We propose an LLM-friendly Thinking Language (LLM-TL) to help LLMs decouple the generation of high-level optimization logic and low-level implementation on GPU.<n>Along with a 2-stage reasoning workflow, TL-Code generation and translation, the LLMs can automatically generate FlashAttention implementation on diverse GPU.
arXiv Detail & Related papers (2025-06-14T05:38:19Z) - CUDA-LLM: LLMs Can Write Efficient CUDA Kernels [9.287036563375617]
Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation.<n>We propose a novel framework called textbfFeature SearchReinforcement (FSR) FSR jointly optimize compilation and functional correctness.
arXiv Detail & Related papers (2025-06-10T10:51:03Z) - NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z) - Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.
We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.
Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications.
We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z) - Efficient GPU implementation of randomized SVD and its applications [17.71779625877989]
Matrix decompositions are ubiquitous in machine learning, including applications in dimensionality data compression and deep learning algorithms.
Typical solutions for matrix decompositions have complexity which significantly increases their computational cost and time.
We leverage efficient processing operations that can be run in parallel on modern Graphical Processing Units (GPUs) to reduce the computational burden of computing matrix decompositions.
arXiv Detail & Related papers (2021-10-05T07:42:41Z) - Providing Meaningful Data Summarizations Using Examplar-based Clustering
in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms.
We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z) - CryptGPU: Fast Privacy-Preserving Machine Learning on the GPU [8.633428365391666]
CryptGPU is a system for privacy-preserving machine learning that implements all operations on the GPU.
We introduce a new interface to embed cryptographic operations over secret-shared values into floating-point operations.
We show that our protocols achieve a 2x to 8x improvement in private inference and a 6x to 36x improvement for private training.
arXiv Detail & Related papers (2021-04-22T09:21:40Z) - Bringing UMAP Closer to the Speed of Light with GPU Acceleration [28.64858826371568]
We show a number of techniques that can be used to make a faster and more faithful GPU version of UMAP.
Many of these design choices/lessons are general purpose and may inform the conversion of other graph and manifold learning algorithms to use GPU.
arXiv Detail & Related papers (2020-08-01T19:35:56Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z) - PolyDL: Polyhedral Optimizations for Creation of High Performance DL
primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives.
We develop novel data reuse analysis algorithms using the polyhedral model.
We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z) - TFApprox: Towards a Fast Emulation of DNN Approximate Hardware
Accelerators on GPU [0.4817429789586127]
Energy efficiency of hardware accelerators of deep neural networks (DNN) can be improved by introducing approximate arithmetic circuits.
A software emulation of the DNN accelerator is usually executed on CPU or GPU.
This emulation is typically two or three orders of magnitude slower than a software DNN implementation running on or emulated.
arXiv Detail & Related papers (2020-02-21T08:22:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.