FastQuery: Communication-efficient Embedding Table Query for Private LLM Inference
- URL: http://arxiv.org/abs/2405.16241v1
- Date: Sat, 25 May 2024 13:58:45 GMT
- Title: FastQuery: Communication-efficient Embedding Table Query for Private LLM Inference
- Authors: Chenqi Lin, Tianshi Xu, Zebin Yang, Runsheng Wang, Ru Huang, Meng Li,
- Abstract summary: We propose a private embedding table query optimization framework, dubbed FastQuery.
FastQuery features a communication-aware embedding table quantization algorithm and a one-hot-aware dense packing algorithm.
Compared to prior-art HE-based frameworks, FastQuery achieves more than $4.3times$, $2.7times$, $1.3times$ latency reduction, respectively.
- Score: 3.9340847245305732
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the fast evolution of large language models (LLMs), privacy concerns with user queries arise as they may contain sensitive information. Private inference based on homomorphic encryption (HE) has been proposed to protect user query privacy. However, a private embedding table query has to be formulated as a HE-based matrix-vector multiplication problem and suffers from enormous computation and communication overhead. We observe the overhead mainly comes from the neglect of 1) the one-hot nature of user queries and 2) the robustness of the embedding table to low bit-width quantization noise. Hence, in this paper, we propose a private embedding table query optimization framework, dubbed FastQuery. FastQuery features a communication-aware embedding table quantization algorithm and a one-hot-aware dense packing algorithm to simultaneously reduce both the computation and communication costs. Compared to prior-art HE-based frameworks, e.g., Cheetah, Iron, and Bumblebee, FastQuery achieves more than $4.3\times$, $2.7\times$, $1.3\times$ latency reduction, respectively and more than $75.7\times$, $60.2\times$, $20.2\times$ communication reduction, respectively, on both LLAMA-7B and LLAMA-30B.
Related papers
- Communication-Efficient Publication of Sparse Vectors under Differential Privacy [2.9123921488295768]
We propose a differentially private algorithm for publishing matrices aggregated from sparse vectors.<n>Our algorithm significantly reduces this cost to $O(varepsilon m)$, where $varepsilon$ is the privacy budget.<n>We theoretically prove that our method yields results identical to those of randomized response.
arXiv Detail & Related papers (2025-06-25T08:25:46Z) - Boosting Private Domain Understanding of Efficient MLLMs: A Tuning-free, Adaptive, Universal Prompt Optimization Framework [60.26747209785186]
multimodal large language models (EMLLMs) reduce model size and computational costs and are often deployed on resource-constrained devices.
Existing open-sourceLMs rarely have access to private domain-specific data during the pre-training process.
We propose a tuntextbfunderlineIng-free, atextbfunderlineDaptivtextbfunderlineE, universtextbfunderlineAL textbfunderlinePrompt Optimization Framework.
arXiv Detail & Related papers (2024-12-27T15:21:17Z) - Reasoning Robustness of LLMs to Adversarial Typographical Errors [49.99118660264703]
Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning using Chain-of-Thought (CoT) prompting.
We study the reasoning robustness of LLMs to typographical errors, which can naturally occur in users' queries.
We design an Adversarial Typo Attack ($texttATA$) algorithm that iteratively samples typos for words that are important to the query and selects the edit that is most likely to succeed in attacking.
arXiv Detail & Related papers (2024-11-08T05:54:05Z) - GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching [0.0]
GPT Semantic Cache is a method that leverages semantic caching of query embeddings in in-memory storage (Redis)
Our approach efficiently identifies semantically similar questions, allowing for the retrieval of pre-generated responses without redundant API calls to the Large Language Models.
This technique reduces operational costs and improves response times, enhancing the efficiency of LLM-powered applications.
arXiv Detail & Related papers (2024-11-08T02:21:19Z) - Differential Privacy of Cross-Attention with Provable Guarantee [18.331374727331077]
We design a novel differential privacy (DP) data structure to address the privacy security of cross-attention with a theoretical guarantee.
Our result is robust to adaptive queries in which users can intentionally attack the cross-attention system.
arXiv Detail & Related papers (2024-07-20T01:02:27Z) - Efficient Contextual LLM Cascades through Budget-Constrained Policy Learning [31.972053219549757]
TREACLE is a reinforcement learning policy that jointly selects the model and prompting scheme while respecting the user's monetary cost and latency constraints.
Our evaluations show that TREACLE enables cost savings of up to 85% compared to baselines, while maintaining high accuracy.
arXiv Detail & Related papers (2024-04-17T05:56:49Z) - Optimizing LLM Queries in Relational Workloads [58.254894049950366]
We show how to optimize Large Language Models (LLMs) inference for analytical workloads that invoke LLMs within relational queries.
We implement these optimizations in Apache Spark, with vLLM as the model serving backend.
We achieve up to 4.4x improvement in end-to-end latency on a benchmark of diverse LLM-based queries on real datasets.
arXiv Detail & Related papers (2024-03-09T07:01:44Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - Robust Approximation Algorithms for Non-monotone $k$-Submodular
Maximization under a Knapsack Constraint [0.0]
Two deterministic approximation algorithms are presented for the problem of non-monotone $k$-submodular complexity under a knapsack constraint.
Our algorithms provide constant approximation ratios within only $O(nk)$ query complexity for the non-monotone objective.
arXiv Detail & Related papers (2023-09-21T12:42:52Z) - JoinGym: An Efficient Query Optimization Environment for Reinforcement
Learning [58.71541261221863]
Join order selection (JOS) is the problem of ordering join operations to minimize total query execution cost.
We present JoinGym, a query optimization environment for bushy reinforcement learning (RL)
Under the hood, JoinGym simulates a query plan's cost by looking up intermediate result cardinalities from a pre-computed dataset.
arXiv Detail & Related papers (2023-07-21T17:00:06Z) - On Optimal Caching and Model Multiplexing for Large Model Inference [66.50550915522551]
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges.
We study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing.
arXiv Detail & Related papers (2023-06-03T05:01:51Z) - Private Query Release via the Johnson-Lindenstrauss Transform [93.20051580730234]
We introduce a new method for releasing answers to statistical queries with differential privacy.
The key idea is to randomly project the query answers to a lower dimensional space.
We answer the projected queries using a simple noise-adding mechanism, and lift the answers up to the original dimension.
arXiv Detail & Related papers (2022-08-15T19:19:16Z) - Session-Aware Query Auto-completion using Extreme Multi-label Ranking [61.753713147852125]
We take the novel approach of modeling session-aware query auto-completion as an e Multi-Xtreme Ranking (XMR) problem.
We adapt a popular XMR algorithm for this purpose by proposing several modifications to the key steps in the algorithm.
Our approach meets the stringent latency requirements for auto-complete systems while leveraging session information in making suggestions.
arXiv Detail & Related papers (2020-12-09T17:56:22Z) - Binary Classification with XOR Queries: Fundamental Limits and An
Efficient Algorithm [15.127728811011245]
We consider a query-based data acquisition problem for binary classification of unknown labels.
We consider an effective query type that asks "group attribute" of a chosen subset of objects.
We propose an efficient inference algorithm that achieves this limit even when the noise parameters are unknown.
arXiv Detail & Related papers (2020-01-31T11:23:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.