Related papers: VELO: A Vector Database-Assisted Cloud-Edge Collaborative LLM QoS Optimization Framework

VELO: A Vector Database-Assisted Cloud-Edge Collaborative LLM QoS Optimization Framework

URL: http://arxiv.org/abs/2406.13399v1
Date: Wed, 19 Jun 2024 09:41:37 GMT
Title: VELO: A Vector Database-Assisted Cloud-Edge Collaborative LLM QoS Optimization Framework
Authors: Zhi Yao, Zhiqing Tang, Jiong Lou, Ping Shen, Weijia Jia,
Abstract summary: Large Language Model (LLM) has gained significant popularity and is extensively utilized across various domains. Most LLM deployments occur within cloud data centers, where they encounter substantial response delays and incur high costs. Leveraging vector database caching to store LLM request results at the edge can substantially mitigate response delays and cost associated with similar requests.
Score: 10.716259527813522
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Large Language Model (LLM) has gained significant popularity and is extensively utilized across various domains. Most LLM deployments occur within cloud data centers, where they encounter substantial response delays and incur high costs, thereby impacting the Quality of Services (QoS) at the network edge. Leveraging vector database caching to store LLM request results at the edge can substantially mitigate response delays and cost associated with similar requests, which has been overlooked by previous research. Addressing these gaps, this paper introduces a novel Vector database-assisted cloud-Edge collaborative LLM QoS Optimization (VELO) framework. Firstly, we propose the VELO framework, which ingeniously employs vector database to cache the results of some LLM requests at the edge to reduce the response time of subsequent similar requests. Diverging from direct optimization of the LLM, our VELO framework does not necessitate altering the internal structure of LLM and is broadly applicable to diverse LLMs. Subsequently, building upon the VELO framework, we formulate the QoS optimization problem as a Markov Decision Process (MDP) and devise an algorithm grounded in Multi-Agent Reinforcement Learning (MARL) to decide whether to request the LLM in the cloud or directly return the results from the vector database at the edge. Moreover, to enhance request feature extraction and expedite training, we refine the policy network of MARL and integrate expert demonstrations. Finally, we implement the proposed algorithm within a real edge system. Experimental findings confirm that our VELO framework substantially enhances user satisfaction by concurrently diminishing delay and resource consumption for edge users utilizing LLMs.

Related papers

LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression. LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model. Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z)
Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization [61.02719787737867]
Large language models (LLMs) are increasingly deployed and democratized on edge devices. One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM. We conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings.
arXiv Detail & Related papers (2025-02-06T18:59:11Z)
AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning [36.37717583840935]
We propose a novel LLM utilization paradigm that facilitates the collaborative operation of large cloud-based LLMs and smaller local-deployed LLMs. Our framework comprises two primary modules: the local agent instantiated with a relatively smaller LLM, and the cloud agent equipped with a larger LLM. This collaborative processing is enabled through an adaptive mechanism where the local agent introspectively identifies errors and proactively seeks assistance from the cloud agent.
arXiv Detail & Related papers (2024-10-17T03:07:37Z)
A Decoding Acceleration Framework for Industrial Deployable LLM-based Recommender Systems [49.588316022381385]
We propose a Decoding Acceleration Framework for LLM-based Recommendation (dubbed DARE), with Customized Retrieval Pool to improve retrieval efficiency and Relaxed Verification to increase the acceptance rate of draft tokens. DARE has been deployed to online advertising scenarios within a large-scale commercial environment, achieving a 3.45x speedup while maintaining the downstream performance.
arXiv Detail & Related papers (2024-08-11T02:31:13Z)
ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency [20.33467627548677]
Large language models (LLMs) have surged in popularity and are extensively used in commercial applications. We conduct a detailed analysis to identify major bottlenecks that impact end-to-end latency in LLM serving systems. We then propose ScaleLLM, an optimized system for resource-efficient LLM serving.
arXiv Detail & Related papers (2024-07-23T23:37:29Z)
Mobile Edge Intelligence for Large Language Models: A Contemporary Survey [32.22789677882933]
On-device large language models (LLMs) are more cost-effective, latency-efficient, and privacy-preserving compared with the cloud paradigm. Mobile edge intelligence (MEI) presents a viable solution by provisioning AI capabilities at the edge of mobile networks. This article provides a contemporary survey on harnessing MEI for LLMs.
arXiv Detail & Related papers (2024-07-09T13:47:05Z)
Efficient Prompting for LLM-based Generative Internet of Things [88.84327500311464]
Large language models (LLMs) have demonstrated remarkable capacities on various tasks, and integrating the capacities of LLMs into the Internet of Things (IoT) applications has drawn much research attention recently. Due to security concerns, many institutions avoid accessing state-of-the-art commercial LLM services, requiring the deployment and utilization of open-source LLMs in a local network setting. We propose a LLM-based Generative IoT (GIoT) system deployed in the local network setting in this study.
arXiv Detail & Related papers (2024-06-14T19:24:00Z)
One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving [2.9164564021428845]
We propose a multi-model queue management framework for large language models (LLMs) serving. QLM orchestrates the actions of multiple LLM Serving Operations (LSOs) to reduce HOL blocking and maximize attainment. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400%.
arXiv Detail & Related papers (2024-06-05T21:17:34Z)
Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [70.09561665520043]
We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans. We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems. Experiments on Over-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents.
arXiv Detail & Related papers (2024-05-23T08:33:19Z)
Optimizing LLM Queries in Relational Workloads [58.254894049950366]
We show how to optimize Large Language Models (LLMs) inference for analytical workloads that invoke LLMs within relational queries. We implement these optimizations in Apache Spark, with vLLM as the model serving backend. We achieve up to 4.4x improvement in end-to-end latency on a benchmark of diverse LLM-based queries on real datasets.
arXiv Detail & Related papers (2024-03-09T07:01:44Z)
FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large Language Models in Federated Learning [70.38817963253034]
This paper first discusses these challenges of federated fine-tuning LLMs, and introduces our package FS-LLM as a main contribution. We provide comprehensive federated parameter-efficient fine-tuning algorithm implementations and versatile programming interfaces for future extension in FL scenarios. We conduct extensive experiments to validate the effectiveness of FS-LLM and benchmark advanced LLMs with state-of-the-art parameter-efficient fine-tuning algorithms in FL settings.
arXiv Detail & Related papers (2023-09-01T09:40:36Z)
LaGR-SEQ: Language-Guided Reinforcement Learning with Sample-Efficient Querying [71.86163159193327]
Large language models (LLMs) have recently demonstrated their impressive ability to provide context-aware responses via text. This ability could potentially be used to predict plausible solutions in sequential decision making tasks pertaining to pattern completion. We introduce LaGR, which uses this predictive ability of LLMs to propose solutions to tasks that have been partially completed by a primary reinforcement learning (RL) agent.
arXiv Detail & Related papers (2023-08-21T02:07:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.