Related papers: PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption

PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption

URL: http://arxiv.org/abs/2411.03357v1
Date: Mon, 04 Nov 2024 19:58:53 GMT
Title: PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption
Authors: Yifan Tan, Cheng Tan, Zeyu Mi, Haibo Chen,
Abstract summary: encryption incurs a significant performance overhead. We introduce PipeLLM, a user-transparent runtime system. We propose speculative pipelined encryption to predict the data requiring encryption.
Score: 5.667756833450548
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Confidential computing on GPUs, like NVIDIA H100, mitigates the security risks of outsourced Large Language Models (LLMs) by implementing strong isolation and data encryption. Nonetheless, this encryption incurs a significant performance overhead, reaching up to 52.8 percent and 88.2 percent throughput drop when serving OPT-30B and OPT-66B, respectively. To address this challenge, we introduce PipeLLM, a user-transparent runtime system. PipeLLM removes the overhead by overlapping the encryption and GPU computation through pipelining - an idea inspired by the CPU instruction pipelining - thereby effectively concealing the latency increase caused by encryption. The primary technical challenge is that, unlike CPUs, the encryption module lacks prior knowledge of the specific data needing encryption until it is requested by the GPUs. To this end, we propose speculative pipelined encryption to predict the data requiring encryption by analyzing the serving patterns of LLMs. Further, we have developed an efficient, low-cost pipeline relinquishing approach for instances of incorrect predictions. Our experiments on NVIDIA H100 GPU show that compared with vanilla systems without confidential computing (e.g., vLLM, PEFT, and FlexGen), PipeLLM incurs modest overhead (less than 19.6 percent in throughput) across various LLM sizes, from 13B to 175B.

Related papers

Real-Time Bit-Level Encryption of Full High-Definition Video Without Diffusion [1.0202696337641386]
Real-time encryption algorithms are inadequate to meet the demands of real-time encryption for high-resolution video.<n>This paper proposes a real-time video encryption protocol based on heterogeneous parallel computing.<n>Experiments show that our approach exhibits superior statistical properties and robust security.
arXiv Detail & Related papers (2025-05-12T00:36:45Z)
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float [71.43026659686679]
Large Language Models (LLMs) have grown rapidly in size, creating challenges for efficient deployment on resource-constrained hardware. We introduce Dynamic-Length Float (DFloat11), a compression framework that reduces LLM size by 30% while preserving outputs that are bit-for-bit identical to the original model.
arXiv Detail & Related papers (2025-04-15T22:38:38Z)
ParallelComp: Parallel Long-Context Compressor for Length Extrapolation [51.68913021512016]
Extrapolating ultra-long contexts (text length >128K) remains a major challenge for large language models (LLMs)<n>In this work, we propose ParallelComp, a parallel long-context compression method that effectively overcomes the memory bottleneck.<n>We achieve a 1.76x improvement in chunk throughput, thereby achieving a 23.50x acceleration in the prefill stage with negligible performance loss.
arXiv Detail & Related papers (2025-02-20T07:10:43Z)
Fastrack: Fast IO for Secure ML using GPU TEEs [7.758531952461963]
GPU-based Trusted Execution Environments (TEEs) offer secure, high-performance solutions. CPU-to-GPU communication overheads significantly hinder performance. This paper analyzes Nvidia H100 TEE protocols and identifies three key overheads. We propose Fastrack, optimizing with 1) direct GPU TEE communication, 2) parallelized authentication, and 3) overlapping decryption with PCI-e transmission.
arXiv Detail & Related papers (2024-10-20T01:00:33Z)
Cheddar: A Swift Fully Homomorphic Encryption Library for CUDA GPUs [2.613335121517245]
Fully homomorphic encryption (FHE) is a cryptographic technology capable of resolving security and privacy problems in cloud computing by encrypting data in use. FHE introduces tremendous computational overhead for processing encrypted data, causing FHE workloads to become 2-6 orders of magnitude slower than their unencrypted counterparts. We propose Cheddar, an FHE library for GPU, which demonstrates significantly faster performance compared to prior GPU implementations.
arXiv Detail & Related papers (2024-07-17T23:49:18Z)
NTTSuite: Number Theoretic Transform Benchmarks for Accelerating Encrypted Computation [2.704681057324485]
Homomorphic encryption (HE) is a cryptographic system that enables computation directly on encrypted data. HE has seen little adoption due to extremely high computational overheads, rendering it impractical. We develop a benchmark suite, named NTTSuite, to enable researchers to better address these overheads. We find our implementation outperforms the state-of-the-art by 30%.
arXiv Detail & Related papers (2024-05-18T17:44:17Z)
GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption [33.87964584665433]
Homomorphic Encryption (FHE) enables the processing of encrypted data without decrypting it. FHE introduces a slowdown of up to five orders of magnitude as compared to the same computation using plaintext data. We propose GME, which combines three key microarchitectural extensions along with a compile-time optimization to the current AMD CDNA GPU architecture.
arXiv Detail & Related papers (2023-09-20T01:50:43Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
ArctyrEX : Accelerated Encrypted Execution of General-Purpose Applications [6.19586646316608]
Fully Homomorphic Encryption (FHE) is a cryptographic method that guarantees the privacy and security of user data during computation. We develop new techniques for accelerated encrypted execution and demonstrate the significant performance advantages of our approach.
arXiv Detail & Related papers (2023-06-19T15:15:41Z)
THE-X: Privacy-Preserving Transformer Inference with Homomorphic Encryption [112.02441503951297]
Privacy-preserving inference of transformer models is on the demand of cloud service users. We introduce $textitTHE-X$, an approximation approach for transformers, which enables privacy-preserving inference of pre-trained models.
arXiv Detail & Related papers (2022-06-01T03:49:18Z)
Providing Meaningful Data Summarizations Using Examplar-based Clustering in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms. We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z)
FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks. Current networks often occupy large number of parameters and require heavy computation costs. Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z)
Faster Secure Data Mining via Distributed Homomorphic Encryption [108.77460689459247]
Homomorphic Encryption (HE) is receiving more and more attention recently for its capability to do computations over the encrypted field. We propose a novel general distributed HE-based data mining framework towards one step of solving the scaling problem. We verify the efficiency and effectiveness of our new framework by testing over various data mining algorithms and benchmark data-sets.
arXiv Detail & Related papers (2020-06-17T18:14:30Z)
CryptoSPN: Privacy-preserving Sum-Product Network Inference [84.88362774693914]
We present a framework for privacy-preserving inference of sum-product networks (SPNs) CryptoSPN achieves highly efficient and accurate inference in the order of seconds for medium-sized SPNs.
arXiv Detail & Related papers (2020-02-03T14:49:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.