GenTorrent: Scaling Large Language Model Serving with An Overley Network
- URL: http://arxiv.org/abs/2504.20101v2
- Date: Wed, 30 Apr 2025 21:24:19 GMT
- Title: GenTorrent: Scaling Large Language Model Serving with An Overley Network
- Authors: Fei Fang, Yifan Hua, Shengze Wang, Ruilin Zhou, Yi Liu, Chen Qian, Xiaoxue Zhang,
- Abstract summary: We propose GenTorrent, an LLM serving overlay that harnesses computing resources from decentralized contributors.<n>We identify four key research problems inherent to enabling such a decentralized infrastructure.<n>We believe this work pioneers a new direction for democratizing and scaling future AI serving capabilities.
- Score: 35.05892538683356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While significant progress has been made in research and development on open-source and cost-efficient large-language models (LLMs), serving scalability remains a critical challenge, particularly for small organizations and individuals seeking to deploy and test their LLM innovations. Inspired by peer-to-peer networks that leverage decentralized overlay nodes to increase throughput and availability, we propose GenTorrent, an LLM serving overlay that harnesses computing resources from decentralized contributors. We identify four key research problems inherent to enabling such a decentralized infrastructure: 1) overlay network organization; 2) LLM communication privacy; 3) overlay forwarding for resource efficiency; and 4) verification of serving quality. This work presents the first systematic study of these fundamental problems in the context of decentralized LLM serving. Evaluation results from a prototype implemented on a set of decentralized nodes demonstrate that GenTorrent achieves a latency reduction of over 50% compared to the baseline design without overlay forwarding. Furthermore, the security features introduce minimal overhead to serving latency and throughput. We believe this work pioneers a new direction for democratizing and scaling future AI serving capabilities.
Related papers
- LLM-Net: Democratizing LLMs-as-a-Service through Blockchain-based Expert Networks [1.3846014191157405]
This paper introduces LLMs Networks (LLM-Net), a blockchain-based framework that democratizes Large Language Models (LLMs) as a service.<n>By leveraging collective computational resources and distributed domain expertise, LLM-Net incorporates fine-tuned expert models for various specific domains.<n>Our simulation, built on top of state-of-the-art LLMs such as Claude 3.5 Sonnet, Llama 3.1, Grok-2, and GPT-4o, validates the effectiveness of the reputation-based mechanism in maintaining service quality.
arXiv Detail & Related papers (2025-01-13T12:56:05Z) - DeServe: Towards Affordable Offline LLM Inference via Decentralization [42.8973830120059]
This paper presents the design of a decentralized offline serving system for large language model (LLM) inference.<n> utilizing idle GPU resources, our proposed system, DeServe, decentralizes access to LLMs at a lower cost.<n> Experiments demonstrate that DeServe achieves a 6.7x-12.6x improvement in throughput over existing serving system baselines in such conditions.
arXiv Detail & Related papers (2025-01-04T02:10:50Z) - AI Flow at the Network Edge [58.31090055138711]
AI Flow is a framework that streamlines the inference process by jointly leveraging the heterogeneous resources available across devices, edge nodes, and cloud servers.<n>This article serves as a position paper for identifying the motivation, challenges, and principles of AI Flow.
arXiv Detail & Related papers (2024-11-19T12:51:17Z) - Decentralized AI: Permissionless LLM Inference on POKT Network [8.68822221491139]
POKT Network's decentralized Remote Procedure Call infrastructure has surpassed 740 billion requests since launching on MainNet in 2020.
This litepaper illustrates how the network's open-source and permissionless design aligns incentives among model researchers, hardware operators, API providers and users.
arXiv Detail & Related papers (2024-05-30T19:50:07Z) - MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development.
This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z) - Evaluation of a blockchain-enabled resource management mechanism for
NGNs [0.0]
This paper examines the use of blockchain technology for resource management and negotiation among Network Providers (NPs)
The implementation of the resource management mechanism is described in a Smart Contract (SC) and the testbeds use the Raft and the IBFT consensus mechanisms respectively.
arXiv Detail & Related papers (2022-11-01T13:40:26Z) - Decentralized Control with Graph Neural Networks [147.84766857793247]
We propose a novel framework using graph neural networks (GNNs) to learn decentralized controllers.
GNNs are well-suited for the task since they are naturally distributed architectures and exhibit good scalability and transferability properties.
The problems of flocking and multi-agent path planning are explored to illustrate the potential of GNNs in learning decentralized controllers.
arXiv Detail & Related papers (2020-12-29T18:59:14Z) - When Deep Reinforcement Learning Meets Federated Learning: Intelligent
Multi-Timescale Resource Management for Multi-access Edge Computing in 5G
Ultra Dense Network [31.274279003934268]
We first propose an intelligent ultra-dense edge computing (I-UDEC) framework, which integrates blockchain and AI into 5G edge computing networks.
In order to achieve real-time and low overhead computation offloading decisions and resource allocation strategies, we design a novel two-timescale deep reinforcement learning (textit2Ts-DRL) approach.
Our proposed algorithm can reduce task execution time up to 31.87%.
arXiv Detail & Related papers (2020-09-22T15:08:00Z) - Graph Neural Networks for Decentralized Controllers [171.6642679604005]
Dynamical systems comprised of autonomous agents arise in many relevant problems such as robotics, smart grids, or smart cities.
Optimal centralized controllers are readily available but face limitations in terms of scalability and practical implementation.
We propose a framework using graph neural networks (GNNs) to learn decentralized controllers from data.
arXiv Detail & Related papers (2020-03-23T13:51:18Z) - Deep Learning for Ultra-Reliable and Low-Latency Communications in 6G
Networks [84.2155885234293]
We first summarize how to apply data-driven supervised deep learning and deep reinforcement learning in URLLC.
To address these open problems, we develop a multi-level architecture that enables device intelligence, edge intelligence, and cloud intelligence for URLLC.
arXiv Detail & Related papers (2020-02-22T14:38:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.