Related papers: High-Throughput LLM inference on Heterogeneous Clusters

High-Throughput LLM inference on Heterogeneous Clusters

URL: http://arxiv.org/abs/2504.15303v1
Date: Fri, 18 Apr 2025 08:59:11 GMT
Title: High-Throughput LLM inference on Heterogeneous Clusters
Authors: Yi Xiong, Jinqi Huang, Wenjie Huang, Xuebing Yu, Entong Li, Zhixiong Ning, Jinhua Zhou, Li Zeng, Xin Chen,
Abstract summary: Large language model (LLM) inference on heterogeneous clusters presents two main challenges.<n>A novel mechanism is proposed to schedule requests among instances, which fully considers the different processing capabilities of various instances.<n>Extensive experiments show that the proposed scheduler improves throughput by 122.5% and 33.6% on two heterogeneous clusters.
Score: 6.11367906161332
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Nowadays, many companies possess various types of AI accelerators, forming heterogeneous clusters. Efficiently leveraging these clusters for high-throughput large language model (LLM) inference services can significantly reduce costs and expedite task processing. However, LLM inference on heterogeneous clusters presents two main challenges. Firstly, different deployment configurations can result in vastly different performance. The number of possible configurations is large, and evaluating the effectiveness of a specific setup is complex. Thus, finding an optimal configuration is not an easy task. Secondly, LLM inference instances within a heterogeneous cluster possess varying processing capacities, leading to different processing speeds for handling inference requests. Evaluating these capacities and designing a request scheduling algorithm that fully maximizes the potential of each instance is challenging. In this paper, we propose a high-throughput inference service system on heterogeneous clusters. First, the deployment configuration is optimized by modeling the resource amount and expected throughput and using the exhaustive search method. Second, a novel mechanism is proposed to schedule requests among instances, which fully considers the different processing capabilities of various instances. Extensive experiments show that the proposed scheduler improves throughput by 122.5% and 33.6% on two heterogeneous clusters, respectively.

Related papers

Multi Activity Sequence Alignment via Implicit Clustering [50.3168866743067]
We propose a novel framework that overcomes limitations using sequence alignment via implicit clustering. Specifically, our key idea is to perform implicit clip-level clustering while aligning frames in sequences. Our experiments show that our proposed method outperforms state-of-the-art results.
arXiv Detail & Related papers (2025-03-16T14:28:46Z)
Make Optimization Once and for All with Fine-grained Guidance [78.14885351827232]
Learning to Optimize (L2O) enhances optimization efficiency with integrated neural networks.<n>L2O paradigms achieve great outcomes, e.g., refitting, generating unseen solutions iteratively or directly.<n>Our analyses explore general framework for learning optimization, called Diff-L2O, focusing on augmenting solutions from a wider view.
arXiv Detail & Related papers (2025-03-14T14:48:12Z)
A3S: A General Active Clustering Method with Pairwise Constraints [66.74627463101837]
A3S features strategic active clustering adjustment on the initial cluster result, which is obtained by an adaptive clustering algorithm. In extensive experiments across diverse real-world datasets, A3S achieves desired results with significantly fewer human queries.
arXiv Detail & Related papers (2024-07-14T13:37:03Z)
Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity [59.57065228857247]
Retrieval-augmented Large Language Models (LLMs) have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA) We propose a novel adaptive QA framework, that can dynamically select the most suitable strategy for (retrieval-augmented) LLMs based on the query complexity. We validate our model on a set of open-domain QA datasets, covering multiple query complexities, and show that ours enhances the overall efficiency and accuracy of QA systems.
arXiv Detail & Related papers (2024-03-21T13:52:30Z)
Sample-Efficient "Clustering and Conquer" Procedures for Parallel Large-Scale Ranking and Selection [0.0]
We modify the commonly used "divide and conquer" framework in parallel computing by adding a correlation-based clustering step.<n>This seemingly simple modification achieves the optimal sample complexity reduction for a widely used class of efficient large-scale R&S procedures.<n>In large-scale AI applications such as neural architecture search, our methods demonstrate superior performance.
arXiv Detail & Related papers (2024-02-03T15:56:03Z)
End-to-end Learnable Clustering for Intent Learning in Recommendation [54.157784572994316]
We propose a novel intent learning method termed underlineELCRec. It unifies behavior representation learning into an underlineEnd-to-end underlineLearnable underlineClustering framework. We deploy this method on the industrial recommendation system with 130 million page views and achieve promising results.
arXiv Detail & Related papers (2024-01-11T15:22:55Z)
Efficient and Effective Deep Multi-view Subspace Clustering [9.6753782215283]
We propose a novel deep framework, termed Efficient and Effective deep Multi-View Subspace Clustering (E$2$MVSC) Instead of a parameterized FC layer, we design a Relation-Metric Net that decouples network parameter scale from sample numbers for greater computational efficiency. E$2$MVSC yields comparable results to existing methods and achieves state-of-the-art performance in various types of multi-view datasets.
arXiv Detail & Related papers (2023-10-15T03:08:25Z)
One-step Multi-view Clustering with Diverse Representation [47.41455937479201]
We propose a one-step multi-view clustering with diverse representation method, which incorporates multi-view learning and $k$-means into a unified framework. We develop an efficient optimization algorithm with proven convergence to solve the resultant problem.
arXiv Detail & Related papers (2023-06-08T02:52:24Z)
Self-Learning Symmetric Multi-view Probabilistic Clustering [35.96327818838784]
Multi-view Clustering (MVC) has achieved significant progress, with many efforts dedicated to learn knowledge from multiple views. Most existing methods are either not applicable or require additional steps for incomplete MVC. We propose a novel unified framework for incomplete and complete MVC named self-learning symmetric multi-view probabilistic clustering.
arXiv Detail & Related papers (2023-05-12T08:27:03Z)
COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training [42.514897110537596]
Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. designing such clusters to maximize both performance and utilization--to amortize their steep cost--is a challenging task. We introduce COMET, a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training.
arXiv Detail & Related papers (2022-11-30T00:32:37Z)
Semisoft Task Clustering for Multi-Task Learning [2.806911268410107]
Multi-task learning (MTL) aims to improve the performance of multiple related prediction tasks by leveraging useful information from them. We propose a semisoft task clustering approach, which can simultaneously reveal the task clustering structure for both pure mixed tasks as well as select the relevant features. The experimental results based on synthetic and real-world datasets validate the effectiveness and efficiency of the proposed approach.
arXiv Detail & Related papers (2022-11-28T07:23:56Z)
Late Fusion Multi-view Clustering via Global and Local Alignment Maximization [61.89218392703043]
Multi-view clustering (MVC) optimally integrates complementary information from different views to improve clustering performance. Most of existing approaches directly fuse multiple pre-specified similarities to learn an optimal similarity matrix for clustering. We propose late fusion MVC via alignment to address these issues.
arXiv Detail & Related papers (2022-08-02T01:49:31Z)
Conjugate Mixture Models for Clustering Multimodal Data [24.640116037967985]
The problem of multimodal clustering arises whenever the data are gathered with several physically different sensors. We show that multimodal clustering can be addressed within a novel framework, namely conjugate mixture models.
arXiv Detail & Related papers (2020-12-09T10:13:22Z)
Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed. We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.