EchoLM: Accelerating LLM Serving with Real-time Knowledge Distillation
- URL: http://arxiv.org/abs/2501.12689v2
- Date: Fri, 24 Jan 2025 19:13:12 GMT
- Title: EchoLM: Accelerating LLM Serving with Real-time Knowledge Distillation
- Authors: Yifan Yu, Yu Gan, Lillian Tsai, Nikhil Sarda, Jiaming Shen, Yanqi Zhou, Arvind Krishnamurthy, Fan Lai, Henry M. Levy, David Culler,
- Abstract summary: We introduce EchoLM, an in-context caching system that leverages historical requests as examples to guide response generation.
We show that EchoLM has a throughput improvement of 1.4-5.9x while reducing latency by 28-71% without hurting response quality.
- Score: 19.399404969760017
- License:
- Abstract: Large language models (LLMs) have excelled in various applications, yet serving them at scale is challenging due to their substantial resource demands and high latency. Our real-world studies reveal that over 60% of user requests to LLMs have semantically similar counterparts, suggesting the potential for knowledge sharing among requests. However, naively caching and reusing past responses leads to large quality degradation. In this paper, we introduce EchoLM, an in-context caching system that leverages historical requests as examples to guide response generation, enabling selective offloading of requests to more efficient LLMs. However, enabling this real-time knowledge transfer leads to intricate tradeoffs between response quality, latency, and system throughput at scale. For a new request, EchoLM identifies similar, high-utility examples and efficiently prepends them to the input for better response. At scale, EchoLM adaptively routes requests to LLMs of varying capabilities, accounting for response quality and serving loads. EchoLM employs a cost-aware cache replay mechanism to improve example quality and coverage offline, maximizing cache utility and runtime efficiency. Evaluations on millions of open-source requests demonstrate that EchoLM has a throughput improvement of 1.4-5.9x while reducing latency by 28-71% without hurting response quality on average.
Related papers
- Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization [61.02719787737867]
Large language models (LLMs) are increasingly deployed and democratized on edge devices.
One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM.
We conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings.
arXiv Detail & Related papers (2025-02-06T18:59:11Z) - Exploring Knowledge Boundaries in Large Language Models for Retrieval Judgment [56.87031484108484]
Large Language Models (LLMs) are increasingly recognized for their practical applications.
Retrieval-Augmented Generation (RAG) tackles this challenge and has shown a significant impact on LLMs.
By minimizing retrieval requests that yield neutral or harmful results, we can effectively reduce both time and computational costs.
arXiv Detail & Related papers (2024-11-09T15:12:28Z) - GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching [0.0]
GPT Semantic Cache is a method that leverages semantic caching of query embeddings in in-memory storage (Redis)
By storing user queries, our approach efficiently identifies semantically similar questions, allowing for the retrieval of pre-generated responses without redundant API calls to the Large Language Models.
Our experiments demonstrate that GPT Semantic Cache reduces API calls by up to 68.8% across various query categories, with cache hit rates ranging from 61.6% to 68.8%.
arXiv Detail & Related papers (2024-11-08T02:21:19Z) - ALISE: Accelerating Large Language Model Serving with Speculative Scheduling [7.367068885621016]
Large Language Models (LLMs) represent a revolutionary advancement in the contemporary landscape of artificial general intelligence (AGI)
In this paper, we propose a new efficient LLM inference serving framework, named ALISE.
We show that ALISE improves the throughput of inference serving by up to 1.8x and 2.1x under the same latency constraint on the Alpaca and ShareGPT datasets, respectively.
arXiv Detail & Related papers (2024-10-31T00:58:11Z) - AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs [53.6200736559742]
AGENT-CQ consists of two stages: a generation stage and an evaluation stage.
CrowdLLM simulates human crowdsourcing judgments to assess generated questions and answers.
Experiments on the ClariQ dataset demonstrate CrowdLLM's effectiveness in evaluating question and answer quality.
arXiv Detail & Related papers (2024-10-25T17:06:27Z) - A Decoding Acceleration Framework for Industrial Deployable LLM-based Recommender Systems [49.588316022381385]
We propose a Decoding Acceleration Framework for LLM-based Recommendation (dubbed DARE), with Customized Retrieval Pool to improve retrieval efficiency and Relaxed Verification to increase the acceptance rate of draft tokens.
DARE has been deployed to online advertising scenarios within a large-scale commercial environment, achieving a 3.45x speedup while maintaining the downstream performance.
arXiv Detail & Related papers (2024-08-11T02:31:13Z) - VELO: A Vector Database-Assisted Cloud-Edge Collaborative LLM QoS Optimization Framework [10.716259527813522]
Large Language Model (LLM) has gained significant popularity and is extensively utilized across various domains.
Most LLM deployments occur within cloud data centers, where they encounter substantial response delays and incur high costs.
Leveraging vector database caching to store LLM request results at the edge can substantially mitigate response delays and cost associated with similar requests.
arXiv Detail & Related papers (2024-06-19T09:41:37Z) - Llumnix: Dynamic Scheduling for Large Language Model Serving [17.919408899409113]
Inference serving for large language models (LLMs) is the key to unleashing their potential.
We introduce Llumnix, an LLM serving system that reacts to such heterogeneous and unpredictable requests by runtime rescheduling.
We show that Llumnix improves tail latencies by an order of magnitude, accelerates high-priority requests by up to 1.5x, and delivers up to 36% cost savings.
arXiv Detail & Related papers (2024-06-05T13:20:18Z) - Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves [57.974103113675795]
We present a method named Rephrase and Respond' (RaR) which allows Large Language Models to rephrase and expand questions posed by humans.
RaR serves as a simple yet effective prompting method for improving performance.
We show that RaR is complementary to the popular Chain-of-Thought (CoT) methods, both theoretically and empirically.
arXiv Detail & Related papers (2023-11-07T18:43:34Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.