Differentially Private and Communication Efficient Large Language Model Split Inference via Stochastic Quantization and Soft Prompt
- URL: http://arxiv.org/abs/2602.11513v1
- Date: Thu, 12 Feb 2026 03:13:16 GMT
- Title: Differentially Private and Communication Efficient Large Language Model Split Inference via Stochastic Quantization and Soft Prompt
- Authors: Yujie Gu, Richeng Jin, Xiaoyu Ji, Yier Jin, Wenyuan Xu,
- Abstract summary: Large Language Models (LLMs) have achieved remarkable performance and received significant research interest.<n>Existing approaches propose to allow the users to obfuscate the token embeddings before transmission and utilize local models for denoising.<n>We propose textbfDEL, a framework for textbfDifferentially private and communication textbfEfficient textbfLLM split inference.
- Score: 33.701746954914135
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have achieved remarkable performance and received significant research interest. The enormous computational demands, however, hinder the local deployment on devices with limited resources. The current prevalent LLM inference paradigms require users to send queries to the service providers for processing, which raises critical privacy concerns. Existing approaches propose to allow the users to obfuscate the token embeddings before transmission and utilize local models for denoising. Nonetheless, transmitting the token embeddings and deploying local models may result in excessive communication and computation overhead, preventing practical implementation. In this work, we propose \textbf{DEL}, a framework for \textbf{D}ifferentially private and communication \textbf{E}fficient \textbf{L}LM split inference. More specifically, an embedding projection module and a differentially private stochastic quantization mechanism are proposed to reduce the communication overhead in a privacy-preserving manner. To eliminate the need for local models, we adapt soft prompt at the server side to compensate for the utility degradation caused by privacy. To the best of our knowledge, this is the first work that utilizes soft prompt to improve the trade-off between privacy and utility in LLM inference, and extensive experiments on text generation and natural language understanding benchmarks demonstrate the effectiveness of the proposed method.
Related papers
- Privacy-Preserving Mechanisms Enable Cheap Verifiable Inference of LLMs [33.54139088666698]
Large language models (LLMs) continue to grow in size, leading to increased use of third-party hosting services.<n>Existing tools to verify inference typically rely on methods from cryptography such as zero-knowledge proofs (ZKPs)<n>We develop a new insight -- that given a method for performing private LLM inference, one can obtain forms of verified inference at marginal extra cost.
arXiv Detail & Related papers (2026-02-19T10:15:51Z) - Implicit Federated In-context Learning For Task-Specific LLM Fine-Tuning [10.042856500868805]
We propose the Implicit Federated In-Context Learning (IFed-ICL) framework.<n>IFed-ICL draws inspiration from federated learning to establish a novel distributed collaborative paradigm.<n>Compared to traditional methods, IFed-ICL avoids the extensive parameter updates required by conventional fine-tuning methods.
arXiv Detail & Related papers (2025-11-10T06:34:29Z) - Federated Learning-Enabled Hybrid Language Models for Communication-Efficient Token Transmission [87.68447072141402]
Hybrid Language Models (HLMs) combine the low-latency efficiency of Small Language Models (SLMs) on edge devices with the high accuracy of Large Language Models (LLMs) on centralized servers.<n>We propose FedHLM, a communication-efficient HLM framework that integrates uncertainty-aware inference with Federated Learning (FL)
arXiv Detail & Related papers (2025-06-30T02:56:11Z) - PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection [68.8373788348678]
Visual instruction tuning adapts pre-trained Multimodal Large Language Models to follow human instructions.<n>PRISM is the first training-free framework for efficient visual instruction selection.<n>It reduces the end-to-end time for data selection and model tuning to just 30% of conventional pipelines.
arXiv Detail & Related papers (2025-02-17T18:43:41Z) - Boosting Private Domain Understanding of Efficient MLLMs: A Tuning-free, Adaptive, Universal Prompt Optimization Framework [60.26747209785186]
multimodal large language models (EMLLMs) reduce model size and computational costs and are often deployed on resource-constrained devices.<n>Existing open-sourceLMs rarely have access to private domain-specific data during the pre-training process.<n>We propose a tuntextbfunderlineIng-free, atextbfunderlineDaptivtextbfunderlineE, universtextbfunderlineAL textbfunderlinePrompt Optimization Framework.
arXiv Detail & Related papers (2024-12-27T15:21:17Z) - FedDTPT: Federated Discrete and Transferable Prompt Tuning for Black-Box Large Language Models [14.719919025265224]
Fine-tuning large language models (LLMs) with data from specific scenarios poses privacy leakage risks.
We propose for the first time a federated discrete and transferable prompt tuning, namely FedDTPT, for black-box large language models.
Our approach achieves higher accuracy, reduced communication overhead, and robustness to non-iid data in a black-box setting.
arXiv Detail & Related papers (2024-11-01T19:19:23Z) - CELA: Cost-Efficient Language Model Alignment for CTR Prediction [70.65910069412944]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems.<n>Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs)<n>We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z) - ConfusionPrompt: Practical Private Inference for Online Large Language Models [3.8134804426693094]
State-of-the-art large language models (LLMs) are typically deployed as online services, requiring users to transmit detailed prompts to cloud servers.
We introduce ConfusionPrompt, a novel framework for private LLM inference that protects user privacy by decomposing the original prompt into smaller sub-prompts.
We show that ConfusionPrompt achieves significantly higher utility than local inference methods using open-source models and perturbation-based techniques.
arXiv Detail & Related papers (2023-12-30T01:26:42Z) - Split-and-Denoise: Protect large language model inference with local differential privacy [2.572566198588905]
Split-N-Denoise (SnD) is a private inference framework that splits the model to execute the token embedding layer on the client side at minimal computational cost.
We show SnD's effectiveness in optimizing the privacy-utility tradeoff across various LLM architectures and diverse downstream tasks.
arXiv Detail & Related papers (2023-10-13T14:17:33Z) - DisPFL: Towards Communication-Efficient Personalized Federated Learning
via Decentralized Sparse Training [84.81043932706375]
We propose a novel personalized federated learning framework in a decentralized (peer-to-peer) communication protocol named Dis-PFL.
Dis-PFL employs personalized sparse masks to customize sparse local models on the edge.
We demonstrate that our method can easily adapt to heterogeneous local clients with varying computation complexities.
arXiv Detail & Related papers (2022-06-01T02:20:57Z) - DP-NormFedAvg: Normalizing Client Updates for Privacy-Preserving
Federated Learning [48.064786028195506]
We propose to have the clients send a textitfin quantized version of only the textitunit in terms of magnitude information.
We also introduce QTDL, a new differentially private quantization mechanism for unitnorm.
arXiv Detail & Related papers (2021-06-13T21:23:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.