Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs
- URL: http://arxiv.org/abs/2509.18886v1
- Date: Tue, 23 Sep 2025 10:36:47 GMT
- Title: Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs
- Authors: Marcin Chrapek, Marcin Copik, Etienne Mettaz, Torsten Hoefler,
- Abstract summary: Large Language Models (LLMs) are increasingly deployed on converged Cloud and High-Performance Computing infrastructure.<n>As LLMs handle confidential inputs, their heightened security requirements slow adoption in privacy-sensitive sectors such as healthcare and finance.<n>We propose Trusted Execution Environments (TEEs) as a solution for securing end-to-end LLM inference.
- Score: 16.49726695421423
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are increasingly deployed on converged Cloud and High-Performance Computing (HPC) infrastructure. However, as LLMs handle confidential inputs and are fine-tuned on costly, proprietary datasets, their heightened security requirements slow adoption in privacy-sensitive sectors such as healthcare and finance. We investigate methods to address this gap and propose Trusted Execution Environments (TEEs) as a solution for securing end-to-end LLM inference. We validate their practicality by evaluating these compute-intensive workloads entirely within CPU and GPU TEEs. On the CPU side, we conduct an in-depth study running full Llama2 inference pipelines (7B, 13B, 70B) inside Intel's TDX and SGX, accelerated by Advanced Matrix Extensions (AMX). We derive 12 insights, including that across various data types, batch sizes, and input lengths, CPU TEEs impose under 10% throughput and 20% latency overheads, further reduced by AMX. We run LLM inference on NVIDIA H100 Confidential Compute GPUs, contextualizing our CPU findings and observing throughput penalties of 4-8% that diminish as batch and input sizes grow. By comparing performance, cost, and security trade-offs, we show how CPU TEEs can be more cost-effective or secure than their GPU counterparts. To our knowledge, our work is the first to comprehensively demonstrate the performance and practicality of modern TEEs across both CPUs and GPUs for enabling confidential LLMs (cLLMs).
Related papers
- Pushing the Envelope of LLM Inference on AI-PC [45.081663877447816]
ultra-low-bit models (1/1.58/2-bit) match the perplexity and end-task performance of their full-precision counterparts using the same model size.<n>The computational efficiency of state-of-the-art inference runtimes (e.g. bitnet) used to deploy them remains underexplored.<n>We take a bottom-up approach: we first design and implement 1-bit and 2-bit microkernels optimized for modern CPUs, achieving peak computational efficiency.<n>We present end-to-end inference results with 2-bit models that outperform the current SOTA runtime bitnet
arXiv Detail & Related papers (2025-08-08T23:33:38Z) - HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration [13.53425131505526]
Deep learning has driven exponential increases in model parameters and computational demands.<n> NVIDIA GPUs and their-based software ecosystem provide robust support for parallel computing.<n>The ecosystem has established a dominant position in the field of parallel software.<n> translating code to other platforms poses significant challenges due to differences parallel programming paradigms and hardware.
arXiv Detail & Related papers (2025-06-12T06:48:33Z) - Characterization of GPU TEE Overheads in Distributed Data Parallel ML Training [7.236249885667945]
Confidential computing (CC) or trusted execution enclaves (TEEs) is now the most common approach to enable secure computing in the cloud.<n>Recent introduction of GPU TEEs by NVIDIA enables machine learning (ML) models to be trained without leaking model weights or data to the cloud provider.<n>We present an in-depth characterization study on performance overhead associated with running distributed data parallel (DDP) ML training with GPU TEEs.
arXiv Detail & Related papers (2025-01-20T22:23:50Z) - Fastrack: Fast IO for Secure ML using GPU TEEs [7.758531952461963]
GPU-based Trusted Execution Environments (TEEs) offer secure, high-performance solutions.
CPU-to-GPU communication overheads significantly hinder performance.
This paper analyzes Nvidia H100 TEE protocols and identifies three key overheads.
We propose Fastrack, optimizing with 1) direct GPU TEE communication, 2) parallelized authentication, and 3) overlapping decryption with PCI-e transmission.
arXiv Detail & Related papers (2024-10-20T01:00:33Z) - Skipping Computations in Multimodal LLMs [63.29737699997859]
This study investigates redundancy in Multimodal Large Language Models (MLLMs) during inference.
We propose different methods to skip computations, such as skipping entire blocks, FFN or self-attention layers.
Our findings validate that significant amount of computations can be avoided at inference time.
arXiv Detail & Related papers (2024-10-12T09:21:45Z) - TensorTEE: Unifying Heterogeneous TEE Granularity for Efficient Secure Collaborative Tensor Computing [13.983627699836376]
Existing heterogeneous TEE designs are inefficient for collaborative computing due to fine and different memory granularities between CPU and NPU.
We propose a unified tensor-granularity heterogeneous TEE for efficient secure collaborative computing.
The results show that the TEE improves the performance of Large Language Model (LLM) training workloads by 4.0x compared to existing work.
arXiv Detail & Related papers (2024-07-12T00:35:18Z) - MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development.
This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z) - Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems.
We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.