Related papers: Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead

Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead

URL: http://arxiv.org/abs/2407.00066v3
Date: Sat, 01 Feb 2025 21:56:34 GMT
Title: Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead
Authors: Rickard Brüel-Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen, Kristjan Greenewald, Mikhail Yurochkin, Justin Solomon,
Abstract summary: Fine-tuning large language models with low-rank adaptations (LoRAs) has become common practice, often yielding numerous copies of the same LLM differing only in their LoRA updates.<n>This paradigm presents challenges for systems that serve real-time responses to queries that each involve a different LoRA.<n>We propose a method for the joint compression of LoRAs into a shared basis paired with LoRA-specific scaling matrices.
Score: 41.31302904190149
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fine-tuning large language models (LLMs) with low-rank adaptations (LoRAs) has become common practice, often yielding numerous copies of the same LLM differing only in their LoRA updates. This paradigm presents challenges for systems that serve real-time responses to queries that each involve a different LoRA. Prior works optimize the design of such systems but still require continuous loading and offloading of LoRAs, as it is infeasible to store thousands of LoRAs in GPU memory. To mitigate this issue, we investigate the efficacy of compression when serving LoRAs. We propose a method for the joint compression of LoRAs into a shared basis paired with LoRA-specific scaling matrices. We extend our algorithm to learn clusters of LoRAs that are amenable to joint compression, allowing it to scale gracefully to large LoRA collections. Our experiments with up to 1000 LoRAs demonstrate that compressed LoRAs preserve performance while offering major throughput gains in realistic serving scenarios with over a thousand LoRAs, maintaining 80% of the throughput of serving a single LoRA.

Related papers

Each Rank Could be an Expert: Single-Ranked Mixture of Experts LoRA for Multi-Task Learning [53.98941571078398]
Low-Rank Adaptation (LoRA) is widely used for adapting large language models (LLMs) to specific domains due to its efficiency and modularity. Recent works adopt Mixture of Experts (MoE) by treating each LoRA module as an expert, thereby mitigating task interference through multiple specialized LoRA modules. While effective, these methods often isolate knowledge within individual tasks, failing to fully exploit the shared knowledge across related tasks. We propose Single-ranked Mixture of Experts LoRA (textbfSMoRA), which embeds MoE into LoRA by textittreating each rank as an
arXiv Detail & Related papers (2025-01-25T06:56:39Z)
LoRA Learns Less and Forgets Less [25.09261710396838]
Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for large language models. We compare the performance of LoRA and full finetuning on two target domains, programming and mathematics.
arXiv Detail & Related papers (2024-05-15T19:27:45Z)
LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario [61.99243609126672]
We study how to inject backdoor into the LoRA module and dive deeper into LoRA's infection mechanisms. Our aim is to raise awareness of the potential risks under the emerging share-and-play scenario, so as to proactively prevent potential consequences caused by LoRA-as-an-Attack.
arXiv Detail & Related papers (2024-02-29T20:25:16Z)
LoRA-Flow: Dynamic LoRA Fusion for Large Language Models in Generative Tasks [72.88244322513039]
LoRA employs lightweight modules to customize large language models (LLMs) for each downstream task or domain. We propose LoRA-Flow, which utilizes dynamic weights to adjust the impact of different LoRAs. Experiments across six generative tasks demonstrate that our method consistently outperforms baselines with task-level fusion weights.
arXiv Detail & Related papers (2024-02-18T04:41:25Z)
LoraRetriever: Input-Aware LoRA Retrieval and Composition for Mixed Tasks in the Wild [76.67343971195267]
Low-Rank Adaptation (LoRA) provides an efficient solution for fine-tuning large language models (LLM) LoraRetriever is a retrieve-then-compose framework that adaptively retrieves and composes multiple LoRAs according to the input prompts. Experimental results indicate that LoraRetriever consistently outperforms the baselines.
arXiv Detail & Related papers (2024-02-15T15:02:46Z)
LoRA-drop: Efficient LoRA Parameter Pruning based on Output Evaluation [27.123271324468657]
Low-Rank Adaptation (LoRA) is currently the most commonly used. efficient fine-tuning (PEFT) method. It introduces auxiliary parameters for each layer to fine-tune the pre-trained model under limited computing resources. However, it still faces resource consumption challenges when scaling up to larger models.
arXiv Detail & Related papers (2024-02-12T15:34:56Z)
S-LoRA: Serving Thousands of Concurrent LoRA Adapters [59.490751234925206]
Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks. We present S-LoRA, a system designed for the scalable serving of many LoRA adapters.
arXiv Detail & Related papers (2023-11-06T17:26:17Z)
CA-LoRA: Adapting Existing LoRA for Compressed LLMs to Enable Efficient Multi-Tasking on Personal Devices [78.16679232748196]
We introduce a Compression-Aware LoRA (CA-LoRA) framework to transfer Large Language Models (LLMs) to other tasks. Experiment results demonstrate that CA-LoRA outperforms the vanilla LoRA methods applied to a compressed LLM. The source code of CA-LoRA is available at https://github.com/thunlp/CA-LoRA.
arXiv Detail & Related papers (2023-07-15T04:37:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.