Related papers: FHECore: Rethinking GPU Microarchitecture for Fully Homomorphic Encryption

FHECore: Rethinking GPU Microarchitecture for Fully Homomorphic Encryption

URL: http://arxiv.org/abs/2602.22229v1
Date: Tue, 10 Feb 2026 02:55:10 GMT
Title: FHECore: Rethinking GPU Microarchitecture for Fully Homomorphic Encryption
Authors: Lohit Daksha, Seyda Guzelhan, Kaustubh Shivdikar, Carlos Agulló Domingo, Óscar Vera Lopez, Gilbert Jonatan, Hubert Dymarkowski, Aymane El Jerari, José Cano, José L. Abellán, John Kim, David Kaeli, Ajay Joshi,
Abstract summary: Fully Homomorphic Encryption (FHE) enables computation directly on encrypted data but incurs massive computational and memory overheads.<n>Custom accelerators can mitigate these costs, but their long time-to-market and the rapid evolution of FHE algorithms threaten their long-term relevance.<n>We propose FHECore, a specialized functional unit integrated directly into the GPU's Streaming Multiprocessor.
Score: 2.7777199166440827
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fully Homomorphic Encryption (FHE) enables computation directly on encrypted data but incurs massive computational and memory overheads, often exceeding plaintext execution by several orders of magnitude. While custom ASIC accelerators can mitigate these costs, their long time-to-market and the rapid evolution of FHE algorithms threaten their long-term relevance. GPUs, by contrast, offer scalability, programmability, and widespread availability, making them an attractive platform for FHE. However, modern GPUs are increasingly specialized for machine learning workloads, emphasizing low-precision datatypes (e.g., INT$8$, FP$8$) that are fundamentally mismatched to the wide-precision modulo arithmetic required by FHE. Essentially, while GPUs offer ample parallelism, their functional units, like Tensor Cores, are not suited for wide-integer modulo arithmetic required by FHE schemes such as CKKS. Despite this constraint, researchers have attempted to map FHE primitives on Tensor Cores by segmenting wide integers into low-precision (INT$8$) chunks. To overcome these bottlenecks, we propose FHECore, a specialized functional unit integrated directly into the GPU's Streaming Multiprocessor. Our design is motivated by a key insight: the two dominant contributors to latency$-$Number Theoretic Transform and Base Conversion$-$can be formulated as modulo-linear transformations. This allows them to be mapped on a common hardware unit that natively supports wide-precision modulo-multiply-accumulate operations. Our simulations demonstrate that FHECore reduces dynamic instruction count by a geometric mean of $2.41\times$ for CKKS primitives and $1.96\times$ for end-to-end workloads. These reductions translate to performance speedups of $1.57\times$ and $2.12\times$, respectively$-$including a $50\%$ reduction in bootstrapping latency$-$all while inuring a modest $2.4\%$ area overhead.

Related papers

Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs [11.45717904490388]
Recent advances in transformer-based foundation models have made them the default choice for many tasks.<n>Their rapidly growing size makes fitting a full model on a single GPU increasingly difficult and their computational cost prohibitive.<n>Block low-rank (BLR) compression techniques address this challenge by learning compact representations of weight matrices.
arXiv Detail & Related papers (2025-12-24T00:41:13Z)
Evolution Strategies at the Hyperscale [57.75314521465674]
We introduce EGGROLL, an evolution strategies (ES) algorithm designed to scale backprop-free optimization to large population sizes.<n>ES is a set of powerful blackbox optimisation methods that can handle non-differentiable or noisy objectives.<n>EGGROLL overcomes these bottlenecks by generating random matrices $Ain mathbbRmtimes r, Bin mathbbRntimes r$ with $rll min(m,n)$ to form a low-rank matrix perturbation $A Btop$
arXiv Detail & Related papers (2025-11-20T18:56:05Z)
ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels [40.94392896555992]
Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical bandwidth across workloads and new accelerators.<n>Instead of operator-specific techniques, we ask whether a small set of simple, reusable principles can guide the optimal optimal performance of workloads.<n>PKKittens (PK) kernels achieves up to $2.33 times times parallel workloads.
arXiv Detail & Related papers (2025-11-17T21:48:33Z)
Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation [10.605380159381776]
We introduce Tilus, a domain-specific language for General-Purpose GPU computing.<n>It supports low-precision data types with arbitrary bit widths from 1 to 8.<n>Our experiments demonstrate that Tilus efficiently supports a full spectrum of low-precision data types.
arXiv Detail & Related papers (2025-04-17T14:45:03Z)
Chameleon: An Efficient FHE Scheme Switching Acceleration on GPUs [17.536473118470774]
homomorphic encryption (FHE) enables direct computation on encrypted data. Existing efforts primarily focus on single-class FHE schemes, which fail to meet the diverse requirements of data types and functions. We present an efficient GPU-based FHE switching acceleration scheme named Chameleon.
arXiv Detail & Related papers (2024-10-08T11:37:49Z)
Cheddar: A Swift Fully Homomorphic Encryption Library Designed for GPU Architectures [2.613335121517245]
Fully homomorphic encryption (FHE) frees cloud computing from privacy concerns by enabling secure computation on encrypted data.<n>We present Cheddar, a high-performance FHE library for GPU, achieving substantial speedups over previous GPU implementations.
arXiv Detail & Related papers (2024-07-17T23:49:18Z)
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [23.633481089469836]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.<n>We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.<n>Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z)
GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption [33.87964584665433]
Homomorphic Encryption (FHE) enables the processing of encrypted data without decrypting it. FHE introduces a slowdown of up to five orders of magnitude as compared to the same computation using plaintext data. We propose GME, which combines three key microarchitectural extensions along with a compile-time optimization to the current AMD CDNA GPU architecture.
arXiv Detail & Related papers (2023-09-20T01:50:43Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z)
VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator. textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z)
Hybrid Models for Learning to Branch [81.93868699246214]
We propose a new hybrid architecture for efficient branching on CPU machines. The proposed architecture combines the expressive power of GNNs with computationally inexpensive multi-layer perceptrons (MLP) for branching.
arXiv Detail & Related papers (2020-06-26T21:03:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.