Comet: Accelerating Private Inference for Large Language Model by Predicting Activation Sparsity
- URL: http://arxiv.org/abs/2505.07239v1
- Date: Mon, 12 May 2025 05:29:30 GMT
- Title: Comet: Accelerating Private Inference for Large Language Model by Predicting Activation Sparsity
- Authors: Guang Yan, Yuhui Zhang, Zimu Guo, Lutan Zhao, Xiaojun Chen, Chen Wang, Wenhao Wang, Dan Meng, Rui Hou,
- Abstract summary: Secure multi-party computation (MPC) is a promising solution to protect the privacy in LLM inference.<n>MPC requires frequent inter-server communication, causing high performance overhead.<n>We propose an efficient private inference system, Comet, which employs an accurate and fast predictor to predict the sparsity distribution of activation output.<n>Comet achieves a 1.87x-2.63x speedup and a 1.94x-2.64x communication reduction.
- Score: 21.74620410396962
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the growing use of large language models (LLMs) hosted on cloud platforms to offer inference services, privacy concerns about the potential leakage of sensitive information are escalating. Secure multi-party computation (MPC) is a promising solution to protect the privacy in LLM inference. However, MPC requires frequent inter-server communication, causing high performance overhead. Inspired by the prevalent activation sparsity of LLMs, where most neuron are not activated after non-linear activation functions, we propose an efficient private inference system, Comet. This system employs an accurate and fast predictor to predict the sparsity distribution of activation function output. Additionally, we introduce a new private inference protocol. It efficiently and securely avoids computations involving zero values by exploiting the spatial locality of the predicted sparse distribution. While this computation-avoidance approach impacts the spatiotemporal continuity of KV cache entries, we address this challenge with a low-communication overhead cache refilling strategy that merges miss requests and incorporates a prefetching mechanism. Finally, we evaluate Comet on four common LLMs and compare it with six state-of-the-art private inference systems. Comet achieves a 1.87x-2.63x speedup and a 1.94x-2.64x communication reduction.
Related papers
- Cascade: Token-Sharded Private LLM Inference [31.561665382764076]
We propose a new multi-party inference protocol, Cascade, that avoids punitive costs by leveraging sharding in the sequence dimension to maintain privacy.<n>We demonstrate that Cascade is resistant to a generalization of a recent attack that is highly effective against other statistical privacy schemes.
arXiv Detail & Related papers (2025-07-07T17:37:16Z) - Federated Learning-Enabled Hybrid Language Models for Communication-Efficient Token Transmission [87.68447072141402]
Hybrid Language Models (HLMs) combine the low-latency efficiency of Small Language Models (SLMs) on edge devices with the high accuracy of Large Language Models (LLMs) on centralized servers.<n>We propose FedHLM, a communication-efficient HLM framework that integrates uncertainty-aware inference with Federated Learning (FL)
arXiv Detail & Related papers (2025-06-30T02:56:11Z) - Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference [49.77734021302196]
We propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework.<n>To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features.<n>Results show that TOFC achieves up to 60% reduction in data transmission overhead and 50% reduction in system latency.
arXiv Detail & Related papers (2025-03-17T08:37:22Z) - PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders [8.483679748399037]
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing but pose privacy risks by memorizing and leaking Personally Identifiable Information (PII)<n>Existing mitigation strategies, such as differential privacy and neuron-level interventions, often degrade model utility or fail to effectively prevent leakage.<n>We introduce PrivacyScalpel, a novel privacy-preserving framework that leverages interpretability techniques to identify and mitigate PII leakage while maintaining performance.
arXiv Detail & Related papers (2025-03-14T09:31:01Z) - PriFFT: Privacy-preserving Federated Fine-tuning of Large Language Models via Function Secret Sharing [20.148411915688175]
Fine-tuning large language models (LLMs) raises privacy concerns due to the risk of exposing sensitive training data.<n>Recent studies show that adversaries can still infer private information from model updates in FL.<n>We propose PriFFT, a privacy-preserving federated fine-tuning mechanism.
arXiv Detail & Related papers (2025-03-05T03:41:57Z) - The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems [26.528288876732617]
A set of new timing side channels can be exploited to infer confidential system prompts and those issued by other users.<n>These vulnerabilities echo security challenges observed in traditional computing systems.<n>We propose a token-by-token search algorithm to efficiently recover shared prompt prefixes in the caches.
arXiv Detail & Related papers (2024-09-30T06:55:00Z) - Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models [73.48675708831328]
We propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs)
The Efficient Attention Skipping (EAS) method evaluates the attention redundancy and skips the less important MHAs to speed up inference.
The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed.
arXiv Detail & Related papers (2024-03-22T14:20:34Z) - DPZero: Private Fine-Tuning of Language Models without Backpropagation [49.365749361283704]
We introduce DPZero, a novel private zeroth-order algorithm with nearly dimension-independent rates.
The memory efficiency of DPZero is demonstrated in privately fine-tuning RoBERTa and OPT on several downstream tasks.
arXiv Detail & Related papers (2023-10-14T18:42:56Z) - Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM
Inference with Transferable Prompt [96.24800696597707]
We introduce a new perspective to optimize this trade-off by prompting compressed models.
We propose a soft prompt learning method where we expose the compressed model to the prompt learning process.
Our experimental analysis suggests our soft prompt strategy greatly improves the performance of the 8x compressed LLaMA-7B model.
arXiv Detail & Related papers (2023-05-17T20:45:13Z) - Breaking the Communication-Privacy-Accuracy Tradeoff with
$f$-Differential Privacy [51.11280118806893]
We consider a federated data analytics problem in which a server coordinates the collaborative data analysis of multiple users with privacy concerns and limited communication capability.
We study the local differential privacy guarantees of discrete-valued mechanisms with finite output space through the lens of $f$-differential privacy (DP)
More specifically, we advance the existing literature by deriving tight $f$-DP guarantees for a variety of discrete-valued mechanisms.
arXiv Detail & Related papers (2023-02-19T16:58:53Z) - Over-the-Air Federated Learning with Privacy Protection via Correlated
Additive Perturbations [57.20885629270732]
We consider privacy aspects of wireless federated learning with Over-the-Air (OtA) transmission of gradient updates from multiple users/agents to an edge server.
Traditional perturbation-based methods provide privacy protection while sacrificing the training accuracy.
In this work, we aim at minimizing privacy leakage to the adversary and the degradation of model accuracy at the edge server.
arXiv Detail & Related papers (2022-10-05T13:13:35Z) - On Differential Privacy for Federated Learning in Wireless Systems with
Multiple Base Stations [90.53293906751747]
We consider a federated learning model in a wireless system with multiple base stations and inter-cell interference.
We show the convergence behavior of the learning process by deriving an upper bound on its optimality gap.
Our proposed scheduler improves the average accuracy of the predictions compared with a random scheduler.
arXiv Detail & Related papers (2022-08-25T03:37:11Z) - Wireless Federated Learning with Limited Communication and Differential
Privacy [21.328507360172203]
This paper investigates the role of dimensionality reduction in efficient communication and differential privacy (DP) of the local datasets at the remote users for over-the-air computation (AirComp)-based federated learning (FL) model.
arXiv Detail & Related papers (2021-06-01T15:23:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.