Confidant: Customizing Transformer-based LLMs via Collaborative Edge
Training
- URL: http://arxiv.org/abs/2311.13381v1
- Date: Wed, 22 Nov 2023 13:20:59 GMT
- Title: Confidant: Customizing Transformer-based LLMs via Collaborative Edge
Training
- Authors: Yuhao Chen, Yuxuan Yan, Qianqian Yang, Yuanchao Shu, Shibo He, Jiming
Chen
- Abstract summary: Transformer-based large language models (LLMs) have demonstrated impressive capabilities in a variety of natural language processing (NLP) tasks.
It is challenging to deploy and fine-tune LLMs on mobile edge devices with limited computing, memory, and energy budgets.
We propose Confidant, a multi-backend collaborative training framework for customizing state-of-the-art LLMs on commodity mobile devices.
- Score: 18.526329975259483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based large language models (LLMs) have demonstrated impressive
capabilities in a variety of natural language processing (NLP) tasks.
Nonetheless, it is challenging to deploy and fine-tune LLMs on mobile edge
devices with limited computing, memory, and energy budgets. In this paper, we
propose Confidant, a multi-backend collaborative training framework for
customizing state-of-the-art LLMs on commodity mobile devices like smartphones.
Confidant partitions an LLM into several sub-models so that each fits into a
mobile device's memory. A pipeline parallel training mechanism is further
developed to ensure fast and efficient distributed training. In addition, we
propose a novel backend scheduler to allocate different attention heads to
heterogeneous compute hardware, including mobile CPU and GPUs, to maximize the
compute resource utilization on each edge device. Our preliminary experimental
results show that Confidant achieves at most 45.3% memory reduction and 8.03x
inference speedup in practical settings.
Related papers
- PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms [11.87161637895978]
We introduce our lightweight, all-in-one automated benchmarking framework that allows users to evaluate large language models on mobile devices.
We provide a benchmark of various popular LLMs with different quantization configurations (both weights and activations) across multiple mobile platforms with varying hardware capabilities.
arXiv Detail & Related papers (2024-10-05T03:37:07Z) - Resource Allocation for Stable LLM Training in Mobile Edge Computing [11.366306689957353]
This paper explores a collaborative training framework that integrates mobile users with edge servers to optimize resource allocation.
We formulate a multi-objective optimization problem to minimize the total energy consumption and delay during training.
We also address the common issue of instability in model performance by incorporating stability enhancements into our objective function.
arXiv Detail & Related papers (2024-09-30T12:36:27Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - Mobile Edge Intelligence for Large Language Models: A Contemporary Survey [32.22789677882933]
Mobile edge intelligence (MEI) provides AI capabilities within the edge of mobile networks with improved privacy and latency relative to cloud computing.
MEI sits between on-device AI and cloud-based AI, featuring wireless communications and more powerful computing resources than end devices.
This article provides a contemporary survey on harnessing MEI for LLMs.
arXiv Detail & Related papers (2024-07-09T13:47:05Z) - Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms [4.959530958049395]
We develop a pipeline to Characterize and predict the training performance of modern machine learning (ML) workloads on compute systems.
Our pipeline generalizes to other types of ML workloads, such as Transformer-based NLP models.
It is capable of generating insights such as quickly selecting the fastest embedding table sharding configuration.
arXiv Detail & Related papers (2024-04-19T07:20:33Z) - MELTing point: Mobile Evaluation of Language Transformers [8.238355633015068]
We explore the current state of mobile execution of Large Language Models (LLMs)
We have created our own automation infrastructure, MELT, which supports the headless execution and benchmarking of LLMs on device.
We evaluate popular instruction fine-tuned LLMs and leverage different frameworks to measure their end-to-end and granular performance.
arXiv Detail & Related papers (2024-03-19T15:51:21Z) - MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development.
This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z) - Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems.
We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.