Related papers: Expert Router: Orchestrating Efficient Language Model Inference through Prompt Classification

Expert Router: Orchestrating Efficient Language Model Inference through Prompt Classification

URL: http://arxiv.org/abs/2404.15153v1
Date: Mon, 22 Apr 2024 16:33:42 GMT
Title: Expert Router: Orchestrating Efficient Language Model Inference through Prompt Classification
Authors: Josef Pichlmeier, Philipp Ross, Andre Luckow,
Abstract summary: Large Language Models (LLMs) have experienced widespread adoption across scientific and industrial domains. We introduce Expert Router, a system designed to orchestrate multiple expert models efficiently. Expert Router is a parallel inference system with a central routing gateway that distributes incoming requests.
Score: 0.4726677580049183
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have experienced widespread adoption across scientific and industrial domains due to their versatility and utility for diverse tasks. Nevertheless, deploying and serving these models at scale with optimal throughput and latency remains a significant challenge, primarily because of the high computational and memory demands associated with LLMs. To tackle this limitation, we introduce Expert Router, a system designed to orchestrate multiple expert models efficiently, thereby enhancing scalability. Expert Router is a parallel inference system with a central routing gateway that distributes incoming requests using a clustering method. This approach effectively partitions incoming requests among available LLMs, maximizing overall throughput. Our extensive evaluations encompassed up to 1,000 concurrent users, providing comprehensive insights into the system's behavior from user and infrastructure perspectives. The results demonstrate Expert Router's effectiveness in handling high-load scenarios and achieving higher throughput rates, particularly under many concurrent users.

Related papers

Large Language Models for Power Scheduling: A User-Centric Approach [6.335540414370735]
We introduce a novel architecture for resource scheduling problems by converting an arbitrary user's voice request (VRQ) into a resource allocation vector. Specifically, we design an LLM intent recognition agent to translate the request into an optimization problem (OP), an LLM OP parameter identification agent, and an OP solving agent.
arXiv Detail & Related papers (2024-06-29T15:47:28Z)
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z)
WDMoE: Wireless Distributed Large Language Models with Mixture of Experts [65.57581050707738]
We propose a wireless distributed Large Language Models (LLMs) paradigm based on Mixture of Experts (MoE) We decompose the MoE layer in LLMs by deploying the gating network and the preceding neural network layer at base station (BS) and mobile devices. We design an expert selection policy by taking into account both the performance of the model and the end-to-end latency.
arXiv Detail & Related papers (2024-05-06T02:55:50Z)
Multi Agent DeepRL based Joint Power and Subchannel Allocation in IAB networks [0.0]
Integrated Access and Backhauling (IRL) is a viable approach for meeting the unprecedented need for higher data rates of future generations. In this paper, we show how we can use Deep Q-Learning Network to handle problems with huge action spaces associated with fractional nodes.
arXiv Detail & Related papers (2023-08-31T21:30:25Z)
Reconfigurable Distributed FPGA Cluster Design for Deep Learning Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications. The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z)
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel [19.24542340170026]
We introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP provides support for significantly larger models with near-linear scalability in terms of TFLOPS.
arXiv Detail & Related papers (2023-04-21T23:52:27Z)
MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead. MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated. However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z)
User Clustering for Rate Splitting using Machine Learning [37.734460275850076]
A scalable and much lighter clustering mechanism based on Neural Network (NN) is proposed. The accuracy and performance metrics show that the NN is able to learn and cluster the users based on the noisy channel response.
arXiv Detail & Related papers (2022-05-23T15:05:16Z)
Distributed Deep Learning in Open Collaborations [49.240611132653456]
We propose a novel algorithmic framework designed specifically for collaborative training. We demonstrate the effectiveness of our approach for SwAV and ALBERT pretraining in realistic conditions and achieve performance comparable to traditional setups at a fraction of the cost.
arXiv Detail & Related papers (2021-06-18T16:23:13Z)
PinnerSage: Multi-Modal User Embedding Framework for Recommendations at Pinterest [54.56236567783225]
PinnerSage is an end-to-end recommender system that represents each user via multi-modal embeddings. We conduct several offline and online A/B experiments to show that our method significantly outperforms single embedding methods.
arXiv Detail & Related papers (2020-07-07T17:13:20Z)
Parallelizing Machine Learning as a Service for the End-User [14.389966909395058]
We present a distributed architecture that could be exploited to parallelize a typical ML system pipeline. We propose a case study consisting of a text mining service and discuss how the method can be generalized to many similar applications.
arXiv Detail & Related papers (2020-05-28T15:22:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.