Scaling up Privacy-Preserving ML: A CKKS Implementation of Llama-2-7B
- URL: http://arxiv.org/abs/2601.18511v1
- Date: Mon, 26 Jan 2026 14:17:23 GMT
- Title: Scaling up Privacy-Preserving ML: A CKKS Implementation of Llama-2-7B
- Authors: Jaiyoung Park, Sejin Park, Jai Hyun Park, Jung Ho Ahn, Jung Hee Cheon, Guillaume Hanrot, Jung Woo Kim, Minje Park, Damien Stehlé,
- Abstract summary: homomorphic encryption (FHE) has emerged as a primary solution to provide non-interactive confidential language models (LLMs)<n>We propose an FHE-based private LLM inference solution that allows thousands of input tokens with only a part of them being encrypted.<n>We describe a CKKS-based end-to-end implementation of Llama-2-7B private inference for up to 4096 input tokens, of which the last 128 are encrypted.
- Score: 20.74505614207065
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: As large language models (LLMs) become ubiquitous, privacy concerns pertaining to inference inputs keep growing. In this context, fully homomorphic encryption (FHE) has emerged as a primary cryptographic solution to provide non-interactive confidential LLM inference. Existing solutions scale poorly with the input token length, and hence focus either on small models or larger models with a small number of input tokens. They also suffer from the existence of large outlier values. These values have a strong impact on the evaluation of non-linear layers, leading to large-degree polynomial approximation and thus heavy evaluation costs. We propose an FHE-based private LLM inference solution that allows thousands of input tokens with only a part of them being encrypted: this fits with a scenario where the context is benign and only part of the input is sensitive. To do so, we suggest an unbalanced chunked prefill framework that processes the private and public parts of the input tokens differently. Our framework contains plaintext-plaintext, plaintext-ciphertext and ciphertext-ciphertext computational components. We adopt different strategies and ingredients for each component. We also devise new homomorphic algorithms for specific matrix multiplication and polynomial evaluation tasks encountered during LLM inference. Furthermore, without retraining, we tailor the LLM inference algorithm to reduce the ranges of outlier values: we leverage machine learning strategies (token prepending and rotations) to mitigate the impact of the outliers on non-linear layers. Based on these ingredients, we describe a CKKS-based end-to-end implementation of Llama-2-7B private inference for up to 4096 input tokens, of which the last 128 are encrypted. On a cluster of 8~NVIDIA RTX-4090 GPUs, inference takes 85s for summarization and 33s for generation per output token.
Related papers
- Efficient Decoding Methods for Language Models on Encrypted Data [32.58944595512403]
Homomorphic encryption (HE) enables computation on encrypted data for secure inference.<n>Neural text generation requires decoding methods like argmax and sampling, which are non-polynomial and thus computationally expensive under encryption.<n>We introduce cutmax, an HE-friendly argmax algorithm that reduces cipher operations compared to prior methods, enabling practical greedy decoding under encryption.
arXiv Detail & Related papers (2025-09-10T08:23:14Z) - DP-Fusion: Token-Level Differentially Private Inference for Large Language Models [51.71591819896191]
Large language models (LLMs) do not preserve privacy at inference-time.<n> DP-Fusion provably bounds the influence a set of tokens in the context can have on the LLM's output.<n>We show that our method creates token-level provably privatized documents with substantially improved theoretical and empirical privacy.
arXiv Detail & Related papers (2025-07-06T20:49:39Z) - Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z) - END: Early Noise Dropping for Efficient and Effective Context Denoising [60.24648712022382]
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks.<n>They are often distracted by irrelevant or noisy context in input sequences that degrades output quality.<n>We introduce Early Noise Dropping (textscEND), a novel approach to mitigate this issue without requiring fine-tuning the LLMs.
arXiv Detail & Related papers (2025-02-26T08:07:17Z) - CipherPrune: Efficient and Scalable Private Transformer Inference [12.853162687405465]
Private Transformer inference using cryptographic protocols offers promising solutions for privacy-preserving machine learning.<n>However, it still faces significant runtime overhead (efficiency issues) and challenges in handling long-token inputs.<n>We propose cipheritCipherPrune, an efficient and scalable private inference framework.
arXiv Detail & Related papers (2025-02-24T02:27:54Z) - Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection [49.15148871877941]
Next-token distribution outputs offer a theoretically appealing approach for detection of large language models (LLMs)<n>We propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length.<n>PAWN shows competitive and even better performance in-distribution than the strongest baselines with a fraction of their trainable parameters.
arXiv Detail & Related papers (2025-01-07T17:00:49Z) - Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.<n>To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image.<n>We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z) - Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models [79.70436109672599]
We derive non-vacuous generalization bounds for large language models as large as LLaMA2-70B.
Our work achieves the first non-vacuous bounds for models that are deployed in practice and generate high-quality text.
arXiv Detail & Related papers (2024-07-25T16:13:58Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests.<n>First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics.<n>Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - Memorization for Good: Encryption with Autoregressive Language Models [8.645826579841692]
We propose the first symmetric encryption algorithm with autoregressive language models (SELM)
We show that autoregressive LMs can encode arbitrary data into a compact real-valued vector (i.e., encryption) and then losslessly decode the vector to the original message (i.e. decryption) via random subspace optimization and greedy decoding.
arXiv Detail & Related papers (2023-05-15T05:42:34Z) - Post-Quantum Cryptography(PQC): Generalized ElGamal Cipher over GL(8,F251) [0.0]
Post-quantum cryptography (PQC) attempts to find cryptographic protocols resistant to attacks.<n>This paper focuses on an asymmetric cipher based on a generalized ElGamal non-arbitrated protocol.
arXiv Detail & Related papers (2017-02-12T22:50:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.