Related papers: Efficient Deployment of Large Language Models on Resource-constrained Devices

Efficient Deployment of Large Language Models on Resource-constrained Devices

URL: http://arxiv.org/abs/2501.02438v1
Date: Sun, 05 Jan 2025 04:38:11 GMT
Title: Efficient Deployment of Large Language Models on Resource-constrained Devices
Authors: Zhiwei Yao, Yang Xu, Hongli Xu, Yunming Liao, Zuan Xie,
Abstract summary: It is necessary to fine-tune Large Language Models (LLMs) on resource-constrained devices for various downstream tasks.<n>FedSpine is a framework that combines Efficient Fine-Tuning (PEFT) with structured pruning for efficient deployment of LLMs on resource-constrained devices.<n>We show that FedSpine can speed up fine-tuning by 1.4times$$$times and improve final accuracy by 0.4%-4.5% under the same sparsity level compared to other baselines.
Score: 12.644230479753476
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deploying Large Language Models (LLMs) on resource-constrained (or weak) devices presents significant challenges due to limited resources and heterogeneous data distribution. To address the data concern, it is necessary to fine-tune LLMs using on-device private data for various downstream tasks. While Federated Learning (FL) offers a promising privacy-preserving solution, existing fine-tuning methods retain the original LLM size, leaving issues of high inference latency and excessive memory demands unresolved. Hence, we design FedSpine, an FL framework that combines Parameter- Efficient Fine-Tuning (PEFT) with structured pruning for efficient deployment of LLMs on resource-constrained devices. Specifically, FedSpine introduces an iterative process to prune and tune the parameters of LLMs. To mitigate the impact of device heterogeneity, an online Multi-Armed Bandit (MAB) algorithm is employed to adaptively determine different pruning ratios and LoRA ranks for heterogeneous devices without any prior knowledge of their computing and communication capabilities. As a result, FedSpine maintains higher inference accuracy while improving fine-tuning efficiency. Experimental results conducted on a physical platform with 80 devices demonstrate that FedSpine can speed up fine-tuning by 1.4$\times$-6.9$\times$ and improve final accuracy by 0.4%-4.5% under the same sparsity level compared to other baselines.

Related papers

Resource-Efficient Federated Fine-Tuning Large Language Models for Heterogeneous Data [16.844142562389443]
Fine-tuning large language models (LLMs) via federated learning, i.e., FedLLM, has been proposed to adapt LLMs for various downstream applications in a privacy-preserving way. To reduce the fine-tuning costs on resource-constrained devices, FedLoRA is proposed to fine-tune only a small subset of model parameters by integrating low-rank adaptation (LoRA) into FedLLM. Here, we propose a hierarchical FedLoRA framework, termed HierFedLoRA, to address these challenges.
arXiv Detail & Related papers (2025-03-27T07:05:22Z)
Efficient Federated Fine-Tuning of Large Language Models with Layer Dropout [15.009864792277236]
Fine-tuning plays a crucial role in enabling pre-trained LLMs to evolve from general language comprehension to task-specific expertise. This work proposes DropPEFT, an innovative federated PEFT framework that employs a novel transformer dropout method. We show that DropPEFT can achieve a 1.3-6.3times speedup in model convergence and a 40%-67% reduction in memory footprint.
arXiv Detail & Related papers (2025-03-13T09:59:16Z)
Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization [61.02719787737867]
Large language models (LLMs) are increasingly deployed and democratized on edge devices. One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM. We conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings.
arXiv Detail & Related papers (2025-02-06T18:59:11Z)
Adaptive Rank Allocation for Federated Parameter-Efficient Fine-Tuning of Language Models [40.69348434971122]
We propose FedARA, a novel Adaptive Rank Allocation framework for federated parameter-efficient fine-tuning of language models. FedARA consistently outperforms baselines by an average of 6.95% to 8.49% across various datasets and models under heterogeneous data. Experiments on various edge devices demonstrate substantial decreases in total training time and energy consumption by up to 48.90% and 46.95%, respectively.
arXiv Detail & Related papers (2025-01-24T11:19:07Z)
R-SFLLM: Jamming Resilient Framework for Split Federated Learning with Large Language Models [83.77114091471822]
Split federated learning (SFL) is a compute-efficient paradigm in distributed machine learning (ML) A challenge in SFL, particularly when deployed over wireless channels, is the susceptibility of transmitted model parameters to adversarial jamming. This is particularly pronounced for word embedding parameters in large language models (LLMs), which are crucial for language understanding. A physical layer framework is developed for resilient SFL with LLMs (R-SFLLM) over wireless networks.
arXiv Detail & Related papers (2024-07-16T12:21:29Z)
FedLPS: Heterogeneous Federated Learning for Multiple Tasks with Local Parameter Sharing [14.938531944702193]
We propose Federated Learning with Local Heterogeneous Sharing (FedLPS) FedLPS uses transfer learning to facilitate the deployment of multiple tasks on a single device by dividing the local model into a shareable encoder and task-specific encoders. FedLPS significantly outperforms the state-of-the-art (SOTA) FL frameworks by up to 4.88% and reduces the computational resource consumption by 21.3%.
arXiv Detail & Related papers (2024-02-13T16:30:30Z)
Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes [53.4856038354195]
Pre-trained large language models (LLMs) need fine-tuning to improve their responsiveness to natural language instructions. FedKSeed employs zeroth-order optimization with a finite set of random seeds. It significantly reduces transmission requirements between the server and clients to just a few random seeds.
arXiv Detail & Related papers (2023-12-11T13:03:21Z)
Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization [71.87335804334616]
Federated learning (FL) is a promising paradigm to enable collaborative model training with decentralized data. The training process of Large Language Models (LLMs) generally incurs the update of significant parameters. This paper proposes an efficient partial prompt tuning approach to improve performance and efficiency simultaneously.
arXiv Detail & Related papers (2023-10-23T16:37:59Z)
FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large Language Models in Federated Learning [70.38817963253034]
This paper first discusses these challenges of federated fine-tuning LLMs, and introduces our package FS-LLM as a main contribution. We provide comprehensive federated parameter-efficient fine-tuning algorithm implementations and versatile programming interfaces for future extension in FL scenarios. We conduct extensive experiments to validate the effectiveness of FS-LLM and benchmark advanced LLMs with state-of-the-art parameter-efficient fine-tuning algorithms in FL settings.
arXiv Detail & Related papers (2023-09-01T09:40:36Z)
Adaptive Federated Pruning in Hierarchical Wireless Networks [69.6417645730093]
Federated Learning (FL) is a privacy-preserving distributed learning framework where a server aggregates models updated by multiple devices without accessing their private datasets. In this paper, we introduce model pruning for HFL in wireless networks to reduce the neural network scale. We show that our proposed HFL with model pruning achieves similar learning accuracy compared with the HFL without model pruning and reduces about 50 percent communication cost.
arXiv Detail & Related papers (2023-05-15T22:04:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.