SPA: Towards A Computational Friendly Cloud-Base and On-Devices Collaboration Seq2seq Personalized Generation with Casual Inference
- URL: http://arxiv.org/abs/2403.07088v6
- Date: Thu, 5 Sep 2024 18:26:56 GMT
- Title: SPA: Towards A Computational Friendly Cloud-Base and On-Devices Collaboration Seq2seq Personalized Generation with Casual Inference
- Authors: Yanming Liu, Xinyue Peng, Shi Bo, Ningjing Sang, Yafeng Yan, Xiaolan Ke, Zhiting Zheng, Shaobo Liu, Songhang Deng, Jiannan Cao, Le Dai, Xingzu Liu, Ruilin Nong, Weihao Liu,
- Abstract summary: Large language models require substantial memory storage on low-resource devices.
We propose SPA(Side on Adaption), a lightweight architecture for fast on-devices inference.
- Score: 2.305850376905315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models(LLMs) have shown its outperforming ability on various tasks and question answering. However, LLMs require substantial memory storage on low-resource devices. More critically, the computational speed on these devices is also severely limited. In this paper, we propose SPA(Side Plugin Adaption), a lightweight architecture for fast on-devices inference on the constraints of strict on-devices computation and memory constraints. Compared with other on-devices seq2seq generation, SPA could make a fast and stable inference on low-resource constraints, allowing it to obtain cost effiency. Our method establish an interaction between a pretrained LLMs on-cloud and additive parameters on-devices, which could provide the knowledge on both pretrained LLMs and featured personal feature. Further more, SPA provides a framework to keep feature-base parameters on low computational devices while leave the parameters containing general information on the high computational devices.
Related papers
- DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - Ripple: Accelerating LLM Inference on Smartphones with Correlation-Aware Neuron Management [22.908079935647073]
Large Language Models (LLMs) have achieved remarkable success across various domains, yet deploying them on mobile devices remains an arduous challenge.
We propose Ripple, a novel approach that accelerates LLM inference on smartphones by optimizing neuron placement in flash memory.
We demonstrate that Ripple achieves up to 5.93x improvements in I/O latency compared to the state-of-the-art.
arXiv Detail & Related papers (2024-10-25T03:01:19Z) - Resource Allocation for Stable LLM Training in Mobile Edge Computing [11.366306689957353]
This paper explores a collaborative training framework that integrates mobile users with edge servers to optimize resource allocation.
We formulate a multi-objective optimization problem to minimize the total energy consumption and delay during training.
We also address the common issue of instability in model performance by incorporating stability enhancements into our objective function.
arXiv Detail & Related papers (2024-09-30T12:36:27Z) - Resource Management for Low-latency Cooperative Fine-tuning of Foundation Models at the Network Edge [35.40849522296486]
Large-scale foundation models (FoMos) can perform human-like intelligence.
FoMos need to be adapted to specialized downstream tasks through fine-tuning techniques.
We advocate multi-device cooperation within the device-edge cooperative fine-tuning paradigm.
arXiv Detail & Related papers (2024-07-13T12:47:14Z) - MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases [81.70591346986582]
We introduce MobileAIBench, a benchmarking framework for evaluating Large Language Models (LLMs) and Large Multimodal Models (LMMs) on mobile devices.
MobileAIBench assesses models across different sizes, quantization levels, and tasks, measuring latency and resource consumption on real devices.
arXiv Detail & Related papers (2024-06-12T22:58:12Z) - SpaFL: Communication-Efficient Federated Learning with Sparse Models and Low computational Overhead [75.87007729801304]
SpaFL: a communication-efficient FL framework is proposed to optimize sparse model structures with low computational overhead.
Experiments show that SpaFL improves accuracy while requiring much less communication and computing resources compared to sparse baselines.
arXiv Detail & Related papers (2024-06-01T13:10:35Z) - On the Compressibility of Quantized Large Language Models [13.443384050034922]
Large Language Models (LLMs) are deployed on edge or mobile devices to offer enhanced data privacy and real-time processing capabilities.
LLMs may still be too big to fit entirely into the limited memory of edge or mobile devices and have to be partially loaded from the storage to complete the inference.
We study applying data compression techniques to reduce data movement and thus speed up the inference of quantized LLM on memory-constrained devices.
arXiv Detail & Related papers (2024-03-03T03:27:07Z) - Confidant: Customizing Transformer-based LLMs via Collaborative Edge
Training [18.526329975259483]
Transformer-based large language models (LLMs) have demonstrated impressive capabilities in a variety of natural language processing (NLP) tasks.
It is challenging to deploy and fine-tune LLMs on mobile edge devices with limited computing, memory, and energy budgets.
We propose Confidant, a multi-backend collaborative training framework for customizing state-of-the-art LLMs on commodity mobile devices.
arXiv Detail & Related papers (2023-11-22T13:20:59Z) - Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems.
We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.