A Scalable Multi-GPU Framework for Encrypted Large-Model Inference
- URL: http://arxiv.org/abs/2512.11269v1
- Date: Fri, 12 Dec 2025 04:15:38 GMT
- Title: A Scalable Multi-GPU Framework for Encrypted Large-Model Inference
- Authors: Siddharth Jayashankar, Joshua Kim, Michael B. Sullivan, Wenting Zheng, Dimitrios Skarlatos,
- Abstract summary: Encrypted AI using fully homomorphic encryption (FHE) provides strong privacy guarantees.<n>Recent works proposed ASICs to accelerate FHE, but require expensive advanced manufacturing processes that constrain their computation.<n>This paper presents Cerium, a multi- GPU framework for FHE inference on large models.
- Score: 5.966282323502589
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Encrypted AI using fully homomorphic encryption (FHE) provides strong privacy guarantees; but its slow performance has limited practical deployment. Recent works proposed ASICs to accelerate FHE, but require expensive advanced manufacturing processes that constrain their accessibility. GPUs are a far more accessible platform, but achieving ASIC-level performance using GPUs has remained elusive. Furthermore, state-of-the-art approaches primarily focus on small models that fit comfortably within a single device. Supporting large models such as LLMs in FHE introduces a dramatic increase in computational complexity that requires optimized GPU kernels, along with managing terabyte-scale memory footprints that far exceed the capacity of a single GPU. This paper presents Cerium, a multi-GPU framework for FHE inference on large models. Cerium integrates a domain-specific language, an optimizing compiler, and a runtime system to automatically generate high-performance GPU kernels, manage terabyte-scale memory footprints, and parallelize computation across multiple GPUs. It introduces new IR constructs, compiler passes, sparse polynomial representations, memory-efficient data layouts, and communication-aware parallelization techniques that together enable encrypted inference for models ranging from small CNNs to Llama3-8B. We build Cerium on NVIDIA GPUs and demonstrate significant performance gains. For small models, Cerium outperforms expert-written hand-optimized GPU libraries by up to 2.25 times. Cerium achieves performance competitive with state-of-the-art FHE ASICs, outright matching prior FHE ASIC CraterLake. It is the first GPU system to execute bootstrapping in under 10 milliseconds, achieving 7.5 milliseconds, and is the first to demonstrate encrypted inference for BERT-Base and Llama3-8B in 8 seconds and 134 seconds, respectively.
Related papers
- Horizon-LM: A RAM-Centric Architecture for LLM Training [26.927410607740025]
Horizon-LM is a memory-centric training system that redefines the roles of CPU and GPU for large-model optimization.<n>On a single H200 GPU with 1.5,TB host RAM, Horizon-LM reliably trains models up to 120B parameters.<n>On a standard single A100 machine, Horizon-LM achieves up to 12.2$times$ higher training throughput than DeepSpeed ZeRO-3 with CPU offloading.
arXiv Detail & Related papers (2026-02-04T18:04:46Z) - Multi-GPU Quantum Circuit Simulation and the Impact of Network Performance [0.7340017786387767]
We present the introduction of MPI into the QED-C Application-Oriented Benchmarks to facilitate benchmarking on HPC systems.<n>We benchmark using a variety of interconnect paths, including the recent NVIDIA Grace Blackwell NVL72 architecture.<n>We show that while improvements to GPU architecture have led to speedups of over 4.5X, advances in interconnect performance have had a larger impact with over 16X performance improvements in time to solution.
arXiv Detail & Related papers (2025-11-18T17:04:28Z) - Minute-Long Videos with Dual Parallelisms [57.22737565366549]
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos.<n>We propose a novel distributed inference strategy, termed DualParal.<n>Instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs.
arXiv Detail & Related papers (2025-05-27T11:55:22Z) - FED: Fast and Efficient Dataset Deduplication Framework with GPU Acceleration [4.499466939042501]
Recently, NVIDIA introduced a GPU-based MinHash LSH deduplication method, but it remains suboptimal.<n>This paper proposes a GPU-accelerated deduplication framework, FED, that optimize MinHash LSH for GPU clusters.<n>In large-scale experiments, the deduplication of 1.2 trillion tokens is completed in just 6 hours in a four-node, 16- GPU environment.
arXiv Detail & Related papers (2025-01-02T04:11:23Z) - MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z) - Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading [2.8231000588510757]
Transformers and large language models(LLMs) have seen rapid adoption in all domains.
Training of transformers is very expensive and often hits a memory wall''
We propose a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU.
arXiv Detail & Related papers (2024-10-26T00:43:59Z) - Cheddar: A Swift Fully Homomorphic Encryption Library Designed for GPU Architectures [2.613335121517245]
Fully homomorphic encryption (FHE) frees cloud computing from privacy concerns by enabling secure computation on encrypted data.<n>We present Cheddar, a high-performance FHE library for GPU, achieving substantial speedups over previous GPU implementations.
arXiv Detail & Related papers (2024-07-17T23:49:18Z) - Scaling Tractable Probabilistic Circuits: A Systems Perspective [53.76194929291088]
PyJuice is a general implementation design for PCs that improves prior art in several regards.<n>It is 1-2 orders of magnitude faster than existing systems at training large-scale PCs.<n>PyJuice consumes 2-5x less memory, which enables us to train larger models.
arXiv Detail & Related papers (2024-06-02T14:57:00Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU.
When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z) - Cramming: Training a Language Model on a Single GPU in One Day [64.18297923419627]
Recent trends in language modeling have focused on increasing performance through scaling.
We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU.
We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings.
arXiv Detail & Related papers (2022-12-28T18:59:28Z) - PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning.
However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware.
PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z) - RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks with
Fine-Grain Utilization [5.02836935036198]
We propose RTGPU, which can schedule the execution of multiple GPU applications in real-time to meet hard deadlines.
Our approach provides superior schedulability compared with previous work, and gives real-time guarantees to meet hard deadlines for multiple GPU applications.
arXiv Detail & Related papers (2021-01-25T22:34:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.