Related papers: Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems

Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems

URL: http://arxiv.org/abs/2511.22880v1
Date: Fri, 28 Nov 2025 05:04:02 GMT
Title: Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems
Authors: Shashwat Jaiswal, Shrikara Arun, Anjaly Parayil, Ankur Mallick, Spyros Mastorakis, Alind Khare, Chloi Alverti, Renee St Amant, Chetan Bansal, Victor Rühle, Josep Torrellas,
Abstract summary: Low-Rank Adaptation (LoRA) has become the de facto method for parameter-efficient fine-tuning of large language models (LLMs)<n>In production, LoRA-based models are served at scale, creating multi-tenant environments with hundreds of adapters sharing a base model.<n>We present LoRAServe, a workload-aware dynamic adapter placement and routing framework designed to tame rank diversity in LoRA serving.
Score: 11.584593298674688
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Low-Rank Adaptation (LoRA) has become the de facto method for parameter-efficient fine-tuning of large language models (LLMs), enabling rapid adaptation to diverse domains. In production, LoRA-based models are served at scale, creating multi-tenant environments with hundreds of adapters sharing a base model. However, state-of-the-art serving systems co-batch heterogeneous adapters without accounting for rank (size) variability, leading to severe performance skew, which ultimately requires adding more GPUs to satisfy service-level objectives (SLOs). Existing optimizations, focused on loading, caching, and kernel execution, ignore this heterogeneity, leaving GPU resources underutilized. We present LoRAServe, a workload-aware dynamic adapter placement and routing framework designed to tame rank diversity in LoRA serving. By dynamically rebalancing adapters across GPUs and leveraging GPU Direct RDMA for remote access, LoRAServe maximizes throughput and minimizes tail latency under real-world workload drift. Evaluations on production traces from Company X show that LoRAServe elicits up to 2$\times$ higher throughput, up to 9$\times$ lower TTFT, while using up to 50% fewer GPUs under SLO constraints compared to state-of-the-art systems.

Related papers

tLoRA: Efficient Multi-LoRA Training with Elastic Shared Super-Models [8.42285475305854]
tLoRA is a framework that enables efficient batch training of multiple LoRA jobs.<n> Evaluations using real-world cluster traces demonstrate that tLoRA improves training by 1.2--1.8x, job training completion time by 2.3--5.4x, and GPU utilization by 37%.
arXiv Detail & Related papers (2026-02-06T23:26:02Z)
LoRAFusion: Efficient LoRA Fine-Tuning for LLMs [7.13923757932177]
Low-Rank Adaptation (LoRA) has become the leading Efficient Fine-Tuning (PEFT) method for Large Language Models (LLMs)<n>We introduce LoRAFusion, an efficient LoRA fine-tuning system for LLMs.<n>LoRAFusion achieves up to $1.96times$ ($1.47times$ on average) end-to-end speedup compared to Megatron-LM, and up to $1.46times$ ($1.29times$ on average) improvement over mLoRA.
arXiv Detail & Related papers (2025-09-30T19:26:22Z)
LoRA Fine-Tuning Without GPUs: A CPU-Efficient Meta-Generation Framework for LLMs [8.397730500554047]
Low-Rank Adapters (LoRAs) have transformed the fine-tuning of Large Language Models (LLMs) by enabling parameter-efficient updates.<n>We propose a theoretically grounded approach to LoRA fine-tuning designed specifically for users with limited computational resources.
arXiv Detail & Related papers (2025-07-02T15:24:47Z)
Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights [75.83625828306839]
textbfDrag-and-Drop LLMs (textitDnD) eliminates per-task training by mapping a handful of unlabeled task prompts directly to LoRA weight updates.<n>A lightweight text encoder distills each prompt batch into condition embeddings, which are then transformed by a cascaded hyper-convolutional decoder into the full set of LoRA matrices.
arXiv Detail & Related papers (2025-06-19T15:38:21Z)
HSplitLoRA: A Heterogeneous Split Parameter-Efficient Fine-Tuning Framework for Large Language Models [30.345920952847752]
Large language models (LLMs) have achieved remarkable breakthroughs, revolutionizing the natural language processing domain and beyond.<n>Due to immense parameter sizes, fine-tuning these models with private data for diverse downstream tasks has become mainstream.<n>We propose HSplitLoRA, a framework built on split learning (SL) and low-rank adaptation (LoRA) fine-tuning, for efficiently fine-tuning LLMs on heterogeneous client devices.
arXiv Detail & Related papers (2025-05-05T17:09:19Z)
Dynamic Low-Rank Sparse Adaptation for Large Language Models [54.1231638555233]
Low-rank Sparse Adaptation (LoSA) is a novel method that seamlessly integrates low-rank adaptation into sparse LLM sparsity.<n>LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning.<n>LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden.
arXiv Detail & Related papers (2025-02-20T18:37:32Z)
Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs [75.11449420928139]
Fine-tuning Large Language Models (LLMs) has become a crucial technique for adapting pre-trained models to downstream tasks. Low-Rank Adaptation (LoRA) has emerged as a promising solution, but there exists a gap between the practical performance of low-rank adaptations and its theoretical optimum. We propose eXtreme Gradient Boosting LoRA, a novel framework that bridges this gap by leveraging the power of ensemble learning.
arXiv Detail & Related papers (2024-10-25T17:07:13Z)
Retrieval-Augmented Mixture of LoRA Experts for Uploadable Machine Learning [57.36978335727009]
Low-Rank Adaptation (LoRA) offers an efficient way to fine-tune large language models (LLMs) In this paper, we propose a framework that adaptively retrieves and composes multiple LoRAs based on input prompts.
arXiv Detail & Related papers (2024-06-24T05:24:41Z)
mLoRA: Fine-Tuning LoRA Adapters via Highly-Efficient Pipeline Parallelism in Multiple GPUs [5.735411578779657]
Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is commonly used to adapt a base LLM to multiple downstream tasks. LoRA platforms enable developers to fine-tune multiple models and develop various domain-specific applications simultaneously. Existing model parallelism schemes suffer from high communication overhead and inefficient GPU utilization when training multiple LoRA tasks.
arXiv Detail & Related papers (2023-12-05T05:38:38Z)
S-LoRA: Serving Thousands of Concurrent LoRA Adapters [59.490751234925206]
Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks. We present S-LoRA, a system designed for the scalable serving of many LoRA adapters.
arXiv Detail & Related papers (2023-11-06T17:26:17Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.