Related papers: A Data-driven ML Approach for Maximizing Performance in LLM-Adapter Serving

A Data-driven ML Approach for Maximizing Performance in LLM-Adapter Serving

URL: http://arxiv.org/abs/2508.08343v2
Date: Mon, 27 Oct 2025 14:59:46 GMT
Title: A Data-driven ML Approach for Maximizing Performance in LLM-Adapter Serving
Authors: Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu, Alaa Youssef, Jordi Torres, Josep Ll. Berral,
Abstract summary: This study focuses on determining the joint configuration of concurrent and parallel adapters that maximizes GPU throughput without inducing starvation.<n>We propose a data-driven ML approach leveraging interpretable models to tackle this caching problem.<n>Experiments with the vLLM framework and LoRA adapters show that the Digital Twin reproduces throughput within 5.1% of real results.
Score: 2.6336040306318274
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: With the rapid adoption of Large Language Models (LLMs), LLM-adapters have become increasingly common, providing lightweight specialization of large-scale models. Serving hundreds or thousands of these adapters on a single GPU allows request aggregation, increasing throughput, but may also cause request starvation if GPU memory limits are exceeded. To address this issue, this study focuses on determining the joint configuration of concurrent and parallel adapters that maximizes GPU throughput without inducing starvation, given heterogeneous adapter and traffic properties. We propose a data-driven ML approach leveraging interpretable models to tackle this caching problem and introduce the first Digital Twin capable of reproducing an LLM-adapter serving system, enabling efficient training data generation. Experiments with the vLLM framework and LoRA adapters show that the Digital Twin reproduces throughput within 5.1% of real results, while the ML approach predicts optimal numbers of concurrent and parallel adapters with an error of at most 7.2% under heterogeneous, real-world workloads.

Related papers

Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving [2.6336040306318274]
Large Language Model (LLM) adapters enable low-cost model specialization.<n>LLM adapters introduce complex caching and scheduling challenges in distributed serving systems where hundreds of adapters must be hosted concurrently.<n>This paper presents a data-driven pipeline that computes an adapter placement that serves the workload with the minimum number of GPU.
arXiv Detail & Related papers (2026-02-27T14:22:51Z)
Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems [11.584593298674688]
Low-Rank Adaptation (LoRA) has become the de facto method for parameter-efficient fine-tuning of large language models (LLMs)<n>In production, LoRA-based models are served at scale, creating multi-tenant environments with hundreds of adapters sharing a base model.<n>We present LoRAServe, a workload-aware dynamic adapter placement and routing framework designed to tame rank diversity in LoRA serving.
arXiv Detail & Related papers (2025-11-28T05:04:02Z)
PHLoRA: data-free Post-hoc Low-Rank Adapter extraction from full-rank checkpoint [3.5840378192062956]
We introduce PHLoRA, a simple yet powerful method to extract low-rank adaptation adapters from full-rank fine-tuned models.<n>Unlike prior work that trains each adapter explicitly, our approach decouples fine-tuning from adapter generation.<n>Experiments on text, image, and video benchmarks using the Amazon Nova model family demonstrate that extracted adapters preserve high energy from the full weight delta, can be pruned safely, and yield negligible degradation in downstream task performance when re-merged.
arXiv Detail & Related papers (2025-09-13T20:13:58Z)
Improving Robustness of Foundation Models in Domain Adaptation with Soup-Adapters [0.0]
We show that by training multiple independent adapters and averaging their outputs, the new model has a higher performance and is more robust to distribution shifts compared to any individual adapter.<n>This is also the first study to explore CLIP adapter-style techniques for DINOv2 and to directly compare them with CLIP in this setting.
arXiv Detail & Related papers (2025-07-08T09:26:10Z)
EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices [7.596547050649462]
This paper introduces EdgeLoRA, an efficient system for serving Large Language Models (LLMs) on edge devices in multi-tenant environments.<n> EdgeLoRA incorporates three key innovations: (1) an adaptive adapter selection mechanism to streamline the adapter configuration process; (2) heterogeneous memory management, leveraging intelligent adapter caching and pooling to mitigate memory operation overhead; and (3) batch LoRA inference, enabling efficient batch processing to significantly reduce computational latency.
arXiv Detail & Related papers (2025-07-02T07:47:28Z)
Small Aid, Big Leap: Efficient Test-Time Adaptation for Vision-Language Models with AdaptNet [5.977269026037707]
Test-time adaptation (TTA) has emerged as a critical technique for enhancing the generalization capability of vision-language models (VLMs) during inference.<n>We introduce SAIL, a novel adapter-based TTA framework that leverages a lightweight, learnable AdaptNet to enable efficient and scalable model adaptation.
arXiv Detail & Related papers (2025-06-03T09:16:51Z)
Accelerating Diffusion LLMs via Adaptive Parallel Decoding [60.407727995313074]
We introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel.<n>APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
arXiv Detail & Related papers (2025-05-31T06:10:10Z)
On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z)
Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs [75.11449420928139]
Fine-tuning Large Language Models (LLMs) has become a crucial technique for adapting pre-trained models to downstream tasks. Low-Rank Adaptation (LoRA) has emerged as a promising solution, but there exists a gap between the practical performance of low-rank adaptations and its theoretical optimum. We propose eXtreme Gradient Boosting LoRA, a novel framework that bridges this gap by leveraging the power of ensemble learning.
arXiv Detail & Related papers (2024-10-25T17:07:13Z)
Towards Optimal Adapter Placement for Efficient Transfer Learning [73.1149084352343]
PETL aims to adapt pre-trained models to new downstream tasks while minimizing the number of fine-tuned parameters. adapters, a popular approach in PETL, inject additional capacity into existing networks by incorporating low-rank projections. This paper investigates the relationship between the placement of an adapter and its performance.
arXiv Detail & Related papers (2024-10-21T10:37:17Z)
Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System [75.25394449773052]
Large Language Model (LLM) based multi-agent systems (MAS) show remarkable potential in collaborative problem-solving.<n>Yet they still face critical challenges: low communication efficiency, poor scalability, and a lack of effective parameter-updating optimization methods.<n>We present Optima, a novel framework that addresses these issues by significantly enhancing both communication efficiency and task effectiveness.
arXiv Detail & Related papers (2024-10-10T17:00:06Z)
LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design [23.874726096958135]
We analyze the fine-grained costs of the dynamic adapters and find that the fragmented kernel calls are the root cause. Unlike most existing dynamic structures that adopt layer-wise or block-wise dynamic routing, LoRA-Switch introduces a token-wise routing mechanism. For efficiency, this switching is implemented with an optimized kernel, which fuses the operations for all LoRA adapters at once.
arXiv Detail & Related papers (2024-05-28T01:53:26Z)
Towards Modular LLMs by Building and Reusing a Library of LoRAs [64.43376695346538]
We study how to best build a library of adapters given multi-task data. We introduce model-based clustering, MBC, a method that groups tasks based on the similarity of their adapter parameters. To re-use the library, we present a novel zero-shot routing mechanism, Arrow, which enables dynamic selection of the most relevant adapters.
arXiv Detail & Related papers (2024-05-18T03:02:23Z)
DAM: Dynamic Adapter Merging for Continual Video QA Learning [66.43360542692355]
We present a parameter-efficient method for continual video question-answering (VidQA) learning. Our method uses the proposed Dynamic Adapter Merging to (i) mitigate catastrophic forgetting, (ii) enable efficient adaptation to continually arriving datasets, and (iv) enable knowledge sharing across similar dataset domains. Our DAM model outperforms prior state-of-the-art continual learning approaches by 9.1% while exhibiting 1.9% less forgetting on 6 VidQA datasets spanning various domains.
arXiv Detail & Related papers (2024-03-13T17:53:47Z)
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size. These models require high-end hardware, making them inaccessible to most researchers. We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z)
LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models [75.25782573728677]
This paper presents a framework for adapter-based parameter-efficient fine-tuning (PEFT) of language models (LLMs) The framework includes state-of-the-art open-access LLMs such as LLaMA, BLOOM, and GPT-J, as well as widely used adapters such as Series adapters, Parallel adapter, Prompt-based learning and Reparametrization-based methods. We evaluate the effectiveness of the adapters on fourteen datasets from two different reasoning tasks, Arithmetic Reasoning and Commonsense Reasoning.
arXiv Detail & Related papers (2023-04-04T16:31:37Z)
AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models [119.7093605087114]
Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters. This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation. We introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques.
arXiv Detail & Related papers (2022-05-24T23:41:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.