Efficient Decoding Methods for Language Models on Encrypted Data
- URL: http://arxiv.org/abs/2509.08383v1
- Date: Wed, 10 Sep 2025 08:23:14 GMT
- Title: Efficient Decoding Methods for Language Models on Encrypted Data
- Authors: Matan Avitan, Moran Baruch, Nir Drucker, Itamar Zimerman, Yoav Goldberg,
- Abstract summary: Homomorphic encryption (HE) enables computation on encrypted data for secure inference.<n>Neural text generation requires decoding methods like argmax and sampling, which are non-polynomial and thus computationally expensive under encryption.<n>We introduce cutmax, an HE-friendly argmax algorithm that reduces cipher operations compared to prior methods, enabling practical greedy decoding under encryption.
- Score: 32.58944595512403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) power modern AI applications, but processing sensitive data on untrusted servers raises privacy concerns. Homomorphic encryption (HE) enables computation on encrypted data for secure inference. However, neural text generation requires decoding methods like argmax and sampling, which are non-polynomial and thus computationally expensive under encryption, creating a significant performance bottleneck. We introduce cutmax, an HE-friendly argmax algorithm that reduces ciphertext operations compared to prior methods, enabling practical greedy decoding under encryption. We also propose the first HE-compatible nucleus (top-p) sampling method, leveraging cutmax for efficient stochastic decoding with provable privacy guarantees. Both techniques are polynomial, supporting efficient inference in privacy-preserving settings. Moreover, their differentiability facilitates gradient-based sequence-level optimization as a polynomial alternative to straight-through estimators. We further provide strong theoretical guarantees for cutmax, proving it converges globally to a unique two-level fixed point, independent of the input values beyond the identity of the maximizer, which explains its rapid convergence in just a few iterations. Evaluations on realistic LLM outputs show latency reductions of 24x-35x over baselines, advancing secure text generation.
Related papers
- Volley Revolver: A Novel Matrix-Encoding Method for Privacy-Preserving Deep Learning (Inference++) [0.0]
Homomorphic encryption has emerged as a promising approach for enabling secure machine learning in untrusted environments.<n>In this paper, we propose an improved encoding and computation framework that removes the requirement that a single encrypted ciphertext must fully contain one input image.<n>Our method reformulates the data layout and homomorphic operations to partition high-resolution inputs across multiple ciphertexts.
arXiv Detail & Related papers (2025-12-21T08:40:31Z) - FastFHE: Packing-Scalable and Depthwise-Separable CNN Inference Over FHE [8.949311128871928]
We propose FastFHE to accelerate the model inference while simultaneously high inference accuracy over fully homomorphic encryption.<n>First, we propose a new scalable ciphertext data-packing scheme to save the time and storage consumptions.<n>Third, we figure out a BN dot-product fusion matrix to merge the ciphertext convolutional layer with the batch-normalization layer without incurring extra multiplicative depth.
arXiv Detail & Related papers (2025-11-27T13:14:42Z) - From Bits to Rounds: Parallel Decoding with Exploration for Diffusion Language Models [19.97248408121574]
Diffusion Language Models (DLMs) offer comparable accuracy with faster inference speed via parallel decoding.<n>High-confidence tokens carry negligible information and strictly relying on them limits the effective progress made in each decoding round.<n>We propose Explore-Then-Exploit (ETE), a training-free decoding strategy that maximizes information throughput and decoding efficiency.
arXiv Detail & Related papers (2025-11-26T06:38:37Z) - BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms [101.9736063064503]
Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs)<n>This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyper parameters for speculative decoding as text is being generated.
arXiv Detail & Related papers (2025-05-21T05:56:31Z) - Improving Efficiency in Federated Learning with Optimized Homomorphic Encryption [9.759156649755235]
Federated learning is a method used in machine learning to allow multiple devices to work together on a model without sharing their private data.<n>A key enabler of privacy in FL is homomorphic encryption (HE), which allows computations to be performed directly on encrypted data.<n>My research introduces a novel algorithm to address these inefficiencies while maintaining robust privacy guarantees.
arXiv Detail & Related papers (2025-04-03T19:50:07Z) - FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation.
However, their generation speed is limited by the inherently sequential nature of their decoding process.
This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - When approximate design for fast homomorphic computation provides
differential privacy guarantees [0.08399688944263842]
Differential privacy (DP) and cryptographic primitives are popular countermeasures against privacy attacks.
In this paper, we design SHIELD, a probabilistic approximation algorithm for the argmax operator.
Even if SHIELD could have other applications, we here focus on one setting and seamlessly integrate it in the SPEED collaborative training framework.
arXiv Detail & Related papers (2023-04-06T09:38:01Z) - THE-X: Privacy-Preserving Transformer Inference with Homomorphic
Encryption [112.02441503951297]
Privacy-preserving inference of transformer models is on the demand of cloud service users.
We introduce $textitTHE-X$, an approximation approach for transformers, which enables privacy-preserving inference of pre-trained models.
arXiv Detail & Related papers (2022-06-01T03:49:18Z) - Efficient Batch Homomorphic Encryption for Vertically Federated XGBoost [9.442606239058806]
In this paper, we study the efficiency problem of adapting widely used XGBoost model in real-world applications to vertical federated learning setting.
We propose a novel batch homomorphic encryption method to cut the cost of encryption-related and transmission in nearly half.
arXiv Detail & Related papers (2021-12-08T12:41:01Z) - FFConv: Fast Factorized Neural Network Inference on Encrypted Data [9.868787266501036]
We propose a low-rank factorization method called FFConv to unify convolution and ciphertext packing.
Compared to prior art LoLa and Falcon, our method reduces the inference latency by up to 87% and 12%, respectively.
arXiv Detail & Related papers (2021-02-06T03:10:13Z) - Faster Secure Data Mining via Distributed Homomorphic Encryption [108.77460689459247]
Homomorphic Encryption (HE) is receiving more and more attention recently for its capability to do computations over the encrypted field.
We propose a novel general distributed HE-based data mining framework towards one step of solving the scaling problem.
We verify the efficiency and effectiveness of our new framework by testing over various data mining algorithms and benchmark data-sets.
arXiv Detail & Related papers (2020-06-17T18:14:30Z) - Cryptotree: fast and accurate predictions on encrypted structured data [0.0]
Homomorphic Encryption (HE) is acknowledged for its ability to allow computation on encrypted data, where both the input and output are encrypted.
We propose Cryptotree, a framework that enables the use of Random Forests (RF), a very powerful learning procedure compared to linear regression.
arXiv Detail & Related papers (2020-06-15T11:48:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.