Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches
- URL: http://arxiv.org/abs/2408.10691v2
- Date: Tue, 1 Oct 2024 08:48:34 GMT
- Title: Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches
- Authors: Yanjie Dong, Haijun Zhang, Chengming Li, Song Guo, Victor C. M. Leung, Xiping Hu,
- Abstract summary: Large language models (LLMs) have transitioned from specialized models to versatile foundation models.
LLMs exhibit impressive zero-shot ability, however, require fine-tuning on local datasets and significant resources for deployment.
- Score: 64.42735183056062
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Since the invention of GPT2--1.5B in 2019, large language models (LLMs) have transitioned from specialized models to versatile foundation models. The LLMs exhibit impressive zero-shot ability, however, require fine-tuning on local datasets and significant resources for deployment. Traditional fine-tuning techniques with the first-order optimizers require substantial GPU memory that exceeds mainstream hardware capability. Therefore, memory-efficient methods are motivated to be investigated. Model compression techniques can reduce energy consumption, operational costs, and environmental impact so that to support sustainable artificial intelligence advancements. Additionally, large-scale foundation models have expanded to create images, audio, videos, and multi-modal contents, further emphasizing the need for efficient deployment. Therefore, we are motivated to present a comprehensive overview of the prevalent memory-efficient fine-tuning methods over the network edge. We also review the state-of-the-art literatures on model compression to provide a vision on deploying LLMs over the network edge.
Related papers
- Deploying Large AI Models on Resource-Limited Devices with Split Federated Learning [39.73152182572741]
This paper proposes a novel framework, named Quantized Split Federated Fine-Tuning Large AI Model (SFLAM)
By partitioning the training load between edge devices and servers, SFLAM can facilitate the operation of large models on devices.
SFLAM incorporates quantization management, power control, and bandwidth allocation strategies to enhance training efficiency.
arXiv Detail & Related papers (2025-04-12T07:55:11Z) - SmolVLM: Redefining small and efficient multimodal models [8.849350918179752]
We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference.
We identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints.
Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance.
arXiv Detail & Related papers (2025-04-07T17:58:57Z) - PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing [48.30406812516552]
We introduce the PLM, a Peripheral Language Model, developed through a co-design process that jointly optimize model architecture and edge system constraints.
PLM employs a Multi-head Latent Attention mechanism and employs the squared ReLU activation function to encourage sparsity, thereby reducing peak memory footprint.
evaluation results demonstrate that PLM outperforms existing small language models trained on publicly available data.
arXiv Detail & Related papers (2025-03-15T15:11:17Z) - Sometimes Painful but Certainly Promising: Feasibility and Trade-offs of Language Model Inference at the Edge [3.1471494780647795]
Recent trends show a growing focus on compact models-typically under 10 billion parameters-enabled by techniques such as quantization.
This shift paves the way for LMs on edge devices, offering potential benefits such as enhanced privacy, reduced latency, and improved data sovereignty.
We present a comprehensive evaluation of generative LM inference on representative CPU-based and GPU-accelerated edge devices.
arXiv Detail & Related papers (2025-03-12T07:01:34Z) - Low-Rank Adapters Meet Neural Architecture Search for LLM Compression [1.8434042562191815]
The rapid expansion of Large Language Models (LLMs) has posed significant challenges regarding the computational resources required for fine-tuning and deployment.
Recent advancements in low-rank adapters have demonstrated their efficacy in parameter-efficient fine-tuning (PEFT) of these models.
This paper comprehensively discusses innovative approaches that synergize low-rank representations with Neural Architecture Search (NAS) techniques.
arXiv Detail & Related papers (2025-01-23T02:14:08Z) - Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z) - Dual-Model Distillation for Efficient Action Classification with Hybrid Edge-Cloud Solution [1.8029479474051309]
We design a hybrid edge-cloud solution that leverages the efficiency of smaller models for local processing while deferring to larger, more accurate cloud-based models when necessary.
Specifically, we propose a novel unsupervised data generation method, Dual-Model Distillation (DMD), to train a lightweight switcher model that can predict when the edge model's output is uncertain.
Experimental results on the action classification task show that our framework not only requires less computational overhead, but also improves accuracy compared to using a large model alone.
arXiv Detail & Related papers (2024-10-16T02:06:27Z) - POINTS: Improving Your Vision-language Model with Affordable Strategies [28.611705477757454]
We train a robust baseline model using latest advancements in vision-language models.
We filter pre-training data using perplexity, selecting the lowest perplexity data for training.
During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements.
arXiv Detail & Related papers (2024-09-07T13:41:37Z) - Contemporary Model Compression on Large Language Models Inference [7.307436175842646]
Large Language Models (LLMs) have revolutionized natural language processing by achieving state-of-the-art results across a variety of tasks.
The computational demands of LLM inference, including high memory consumption and slow processing speeds, pose significant challenges for real-world applications.
This survey explores techniques in model compression that address these challenges by reducing the size and computational requirements of LLMs.
arXiv Detail & Related papers (2024-09-03T15:35:01Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - Retrieval-based Knowledge Transfer: An Effective Approach for Extreme
Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks.
However, the massive size of these models poses huge challenges for their deployment in real-world applications.
We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z) - ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language
Models [70.45441031021291]
Large Vision-Language Models (LVLMs) can understand the world comprehensively by integrating rich information from different modalities.
LVLMs are often problematic due to their massive computational/energy costs and carbon consumption.
We propose Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage coarse-to-fine weight pruning approach for LVLMs.
arXiv Detail & Related papers (2023-10-04T17:34:00Z) - Legal-Tech Open Diaries: Lesson learned on how to develop and deploy
light-weight models in the era of humongous Language Models [10.086015702323971]
We follow the steps of the R&D group of a modern legal-tech start-up and present important insights on model development and deployment.
We start from ground zero by pre-training multiple domain-specific multi-lingual LMs which are a better fit to contractual and regulatory text.
We present benchmark results of such models in a half-public half-private legal benchmark comprising 5 downstream tasks showing the impact of larger model size.
arXiv Detail & Related papers (2022-10-24T10:08:59Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.