Related papers: DeServe: Towards Affordable Offline LLM Inference via Decentralization

DeServe: Towards Affordable Offline LLM Inference via Decentralization

URL: http://arxiv.org/abs/2501.14784v1
Date: Sat, 04 Jan 2025 02:10:50 GMT
Title: DeServe: Towards Affordable Offline LLM Inference via Decentralization
Authors: Linyu Wu, Xiaoyuan Liu, Tianneng Shi, Zhe Ye, Dawn Song,
Abstract summary: This paper presents the design of a decentralized offline serving system for large language model (LLM) inference.<n> utilizing idle GPU resources, our proposed system, DeServe, decentralizes access to LLMs at a lower cost.<n> Experiments demonstrate that DeServe achieves a 6.7x-12.6x improvement in throughput over existing serving system baselines in such conditions.
Score: 42.8973830120059
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid growth of generative AI and its integration into everyday workflows have significantly increased the demand for large language model (LLM) inference services. While proprietary models remain popular, recent advancements in open-source LLMs have positioned them as strong contenders. However, deploying these models is often constrained by the high costs and limited availability of GPU resources. In response, this paper presents the design of a decentralized offline serving system for LLM inference. Utilizing idle GPU resources, our proposed system, DeServe, decentralizes access to LLMs at a lower cost. DeServe specifically addresses key challenges in optimizing serving throughput in high-latency network environments. Experiments demonstrate that DeServe achieves a 6.7x-12.6x improvement in throughput over existing serving system baselines in such conditions.

Related papers

GenTorrent: Scaling Large Language Model Serving with An Overley Network [35.05892538683356]
We propose GenTorrent, an LLM serving overlay that harnesses computing resources from decentralized contributors. We identify four key research problems inherent to enabling such a decentralized infrastructure. We believe this work pioneers a new direction for democratizing and scaling future AI serving capabilities.
arXiv Detail & Related papers (2025-04-27T01:08:25Z)
Autellix: An Efficient Serving Engine for LLM Agents as General Programs [59.673243129044465]
Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general-purpose agentic programs. Existing LLM serving systems ignore dependencies between programs and calls, missing significant opportunities for optimization. We introduce Autellix, an LLM serving system that treats programs as first-class citizens to minimize their end-to-end latencies.
arXiv Detail & Related papers (2025-02-19T18:59:30Z)
ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving [15.01982917560918]
This paper proposes to harvest stranded GPU resources for offline LLM inference tasks. We built ConServe, an LLM serving system that contains an execution engine that preempts running offline tasks. Our evaluation demonstrates that ConServe achieves strong performance isolation when co-serving online and offline tasks.
arXiv Detail & Related papers (2024-10-02T04:12:13Z)
ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency [20.33467627548677]
Large language models (LLMs) have surged in popularity and are extensively used in commercial applications. We conduct a detailed analysis to identify major bottlenecks that impact end-to-end latency in LLM serving systems. We then propose ScaleLLM, an optimized system for resource-efficient LLM serving.
arXiv Detail & Related papers (2024-07-23T23:37:29Z)
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development. This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z)
SpotServe: Serving Generative Large Language Models on Preemptible Instances [64.18638174004151]
SpotServe is the first distributed large language models serving system on preemptible instances. We show that SpotServe can reduce the P99 tail latency by 2.4 - 9.1x compared with the best existing LLM serving systems. We also show that SpotServe can leverage the price advantage of preemptive instances, saving 54% monetary cost compared with only using on-demand instances.
arXiv Detail & Related papers (2023-11-27T06:31:17Z)
Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems. We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
POLCA: Power Oversubscription in LLM Cloud Providers [0.8299593158757622]
Large language models (LLMs) are becoming increasingly power intensive. We show that there is a significant opportunity to oversubscribe power in LLM clusters. We propose POLCA, our framework for power oversubscription that is robust, reliable, and readily deployable for GPU clusters.
arXiv Detail & Related papers (2023-08-24T16:32:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.