Related papers: CAT: A GPU-Accelerated FHE Framework with Its Application to High-Precision Private Dataset Query

CAT: A GPU-Accelerated FHE Framework with Its Application to High-Precision Private Dataset Query

URL: http://arxiv.org/abs/2503.22227v1
Date: Fri, 28 Mar 2025 08:20:18 GMT
Title: CAT: A GPU-Accelerated FHE Framework with Its Application to High-Precision Private Dataset Query
Authors: Qirui Li, Rui Zong,
Abstract summary: We introduce an open-source GPU-accelerated fully homomorphic encryption (FHE) framework CAT.<n>emphCAT features a three-layer architecture: a foundation of core math, a bridge of pre-computed elements and combined operations, and an API-accessible layer of FHE operators.<n>Based on our framework, we implement three widely used FHE schemes: CKKS, BFV, and BGV.
Score: 0.51795041186793
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce an open-source GPU-accelerated fully homomorphic encryption (FHE) framework CAT, which surpasses existing solutions in functionality and efficiency. \emph{CAT} features a three-layer architecture: a foundation of core math, a bridge of pre-computed elements and combined operations, and an API-accessible layer of FHE operators. It utilizes techniques such as parallel executed operations, well-defined layout patterns of cipher data, kernel fusion/segmentation, and dual GPU pools to enhance the overall execution efficiency. In addition, a memory management mechanism ensures server-side suitability and prevents data leakage. Based on our framework, we implement three widely used FHE schemes: CKKS, BFV, and BGV. The results show that our implementation on Nvidia 4090 can achieve up to 2173$\times$ speedup over CPU implementation and 1.25$\times$ over state-of-the-art GPU acceleration work for specific operations. What's more, we offer a scenario validation with CKKS-based Privacy Database Queries, achieving a 33$\times$ speedup over its CPU counterpart. All query tasks can handle datasets up to $10^3$ rows on a single GPU within 1 second, using 2-5 GB storage. Our implementation has undergone extensive stability testing and can be easily deployed on commercial GPUs. We hope that our work will significantly advance the integration of state-of-the-art FHE algorithms into diverse real-world systems by providing a robust, industry-ready, and open-source tool.

Related papers

Fastrack: Fast IO for Secure ML using GPU TEEs [7.758531952461963]
GPU-based Trusted Execution Environments (TEEs) offer secure, high-performance solutions. CPU-to-GPU communication overheads significantly hinder performance. This paper analyzes Nvidia H100 TEE protocols and identifies three key overheads. We propose Fastrack, optimizing with 1) direct GPU TEE communication, 2) parallelized authentication, and 3) overlapping decryption with PCI-e transmission.
arXiv Detail & Related papers (2024-10-20T01:00:33Z)
Advanced Techniques for High-Performance Fock Matrix Construction on GPU Clusters [0.0]
opt-UM and opt-Brc introduce significant enhancements to Hartree-Fock caculations up to $f$-type angular momentum functions. Opt-Brc excels for smaller systems and for highly contracted triple-$zeta$ basis sets, while opt-UM is advantageous for large molecular systems.
arXiv Detail & Related papers (2024-07-31T08:49:06Z)
Multi-GPU RI-HF Energies and Analytic Gradients $-$ Towards High Throughput Ab Initio Molecular Dynamics [0.0]
This article presents an optimized algorithm and implementation for calculating resolution-of-the-identity Hartree-Fock energies and analytic gradients using multiple Graphics Processing Units (GPUs) The algorithm is especially designed for high throughput emphab initio molecular dynamics simulations of small and medium size molecules (10-100 atoms)
arXiv Detail & Related papers (2024-07-29T00:14:10Z)
Cheddar: A Swift Fully Homomorphic Encryption Library for CUDA GPUs [2.613335121517245]
Fully homomorphic encryption (FHE) is a cryptographic technology capable of resolving security and privacy problems in cloud computing by encrypting data in use. FHE introduces tremendous computational overhead for processing encrypted data, causing FHE workloads to become 2-6 orders of magnitude slower than their unencrypted counterparts. We propose Cheddar, an FHE library for GPU, which demonstrates significantly faster performance compared to prior GPU implementations.
arXiv Detail & Related papers (2024-07-17T23:49:18Z)
GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption [33.87964584665433]
Homomorphic Encryption (FHE) enables the processing of encrypted data without decrypting it. FHE introduces a slowdown of up to five orders of magnitude as compared to the same computation using plaintext data. We propose GME, which combines three key microarchitectural extensions along with a compile-time optimization to the current AMD CDNA GPU architecture.
arXiv Detail & Related papers (2023-09-20T01:50:43Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
EAutoDet: Efficient Architecture Search for Object Detection [110.99532343155073]
EAutoDet framework can discover practical backbone and FPN architectures for object detection in 1.4 GPU-days. We propose a kernel reusing technique by sharing the weights of candidate operations on one edge and consolidating them into one convolution. In particular, the discovered architectures surpass state-of-the-art object detection NAS methods and achieve 40.1 mAP with 120 FPS and 49.2 mAP with 41.3 FPS on COCO test-dev set.
arXiv Detail & Related papers (2022-03-21T05:56:12Z)
ASH: A Modern Framework for Parallel Spatial Hashing in 3D Perception [91.24236600199542]
ASH is a modern and high-performance framework for parallel spatial hashing on GPU. ASH achieves higher performance, supports richer functionality, and requires fewer lines of code. ASH and its example applications are open sourced in Open3D.
arXiv Detail & Related papers (2021-10-01T16:25:40Z)
Providing Meaningful Data Summarizations Using Examplar-based Clustering in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms. We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z)
Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters. It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions. We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.