PockEngine: Sparse and Efficient Fine-tuning in a Pocket
- URL: http://arxiv.org/abs/2310.17752v1
- Date: Thu, 26 Oct 2023 19:46:11 GMT
- Title: PockEngine: Sparse and Efficient Fine-tuning in a Pocket
- Authors: Ligeng Zhu, Lanxiang Hu, Ji Lin, Wei-Chen Wang, Wei-Ming Chen, Chuang
Gan, Song Han
- Abstract summary: We introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices.
PockEngine supports sparse backpropagation and sparsely updates the model with measured memory saving and latency reduction.
Remarkably, PockEngine enables fine-tuning LLaMav2-7B on NVIDIA Jetson AGX Orin at 550 tokens/s, 7.9$times$ faster than the PyTorch.
- Score: 62.955793932377524
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: On-device learning and efficient fine-tuning enable continuous and
privacy-preserving customization (e.g., locally fine-tuning large language
models on personalized data). However, existing training frameworks are
designed for cloud servers with powerful accelerators (e.g., GPUs, TPUs) and
lack the optimizations for learning on the edge, which faces challenges of
resource limitations and edge hardware diversity. We introduce PockEngine: a
tiny, sparse and efficient engine to enable fine-tuning on various edge
devices. PockEngine supports sparse backpropagation: it prunes the backward
graph and sparsely updates the model with measured memory saving and latency
reduction while maintaining the model quality. Secondly, PockEngine is
compilation first: the entire training graph (including forward, backward and
optimization steps) is derived at compile-time, which reduces the runtime
overhead and brings opportunities for graph transformations. PockEngine also
integrates a rich set of training graph optimizations, thus can further
accelerate the training cost, including operator reordering and backend
switching. PockEngine supports diverse applications, frontends and hardware
backends: it flexibly compiles and tunes models defined in
PyTorch/TensorFlow/Jax and deploys binaries to mobile CPU/GPU/DSPs. We
evaluated PockEngine on both vision models and large language models.
PockEngine achieves up to 15 $\times$ speedup over off-the-shelf TensorFlow
(Raspberry Pi), 5.6 $\times$ memory saving back-propagation (Jetson AGX Orin).
Remarkably, PockEngine enables fine-tuning LLaMav2-7B on NVIDIA Jetson AGX Orin
at 550 tokens/s, 7.9$\times$ faster than the PyTorch.
Related papers
- Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading [2.8231000588510757]
Transformers and large language models(LLMs) have seen rapid adoption in all domains.
Training of transformers is very expensive and often hits a memory wall''
We propose a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU.
arXiv Detail & Related papers (2024-10-26T00:43:59Z) - DistTGL: Distributed Memory-Based Temporal Graph Neural Network Training [18.52206409432894]
DistTGL is an efficient and scalable solution to train memory-based TGNNs on distributed GPU clusters.
In experiments, DistTGL achieves near-linear convergence speedup, outperforming state-of-the-art single-machine method by 14.5% in accuracy and 10.17x in training throughput.
arXiv Detail & Related papers (2023-07-14T22:52:27Z) - InceptionNeXt: When Inception Meets ConvNeXt [167.61042926444105]
We build a series of networks, namely IncepitonNeXt, which not only enjoy high throughputs but also maintain competitive performance.
InceptionNeXt achieves 1.6x higher training throughputs than ConvNeX-T, as well as attains 0.2% top-1 accuracy improvement on ImageNet-1K.
arXiv Detail & Related papers (2023-03-29T17:59:58Z) - RAF: Holistic Compilation for Deep Learning Model Training [17.956035630476173]
In this paper, we present RAF, a deep learning compiler for training.
Unlike existing DLCs, RAF accepts a forward model and in-house generates a training graph.
RAF is able to systematically consolidate graph optimizations for performance, memory and distributed training.
arXiv Detail & Related papers (2023-03-08T17:51:13Z) - Slapo: A Schedule Language for Progressive Optimization of Large Deep
Learning Model Training [17.556432199389615]
Slapo is a schedule language that decouples the execution of a tensor-level operator from its arithmetic definition.
We show that Slapo can improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs.
arXiv Detail & Related papers (2023-02-16T00:34:53Z) - Cramming: Training a Language Model on a Single GPU in One Day [64.18297923419627]
Recent trends in language modeling have focused on increasing performance through scaling.
We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU.
We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings.
arXiv Detail & Related papers (2022-12-28T18:59:28Z) - SegNeXt: Rethinking Convolutional Attention Design for Semantic
Segmentation [100.89770978711464]
We present SegNeXt, a simple convolutional network architecture for semantic segmentation.
We show that convolutional attention is a more efficient and effective way to encode contextual information than the self-attention mechanism in transformers.
arXiv Detail & Related papers (2022-09-18T14:33:49Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - When deep learning models on GPU can be accelerated by taking advantage
of unstructured sparsity [0.0]
This paper is focused on the improvement the efficiency of the sparse convolutional neural networks (CNNs) layers on graphic processing units ( GPU)
The modern CNN models need megabytes of coefficients and needed millions MAC operations to perform convolution.
We show when is worth using a direct sparse operation to speed-up the calculation of the convolution layers.
arXiv Detail & Related papers (2020-11-12T10:13:48Z) - Real-Time Execution of Large-scale Language Models on Mobile [49.32610509282623]
We find the best model structure of BERT for a given computation size to match specific devices.
Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices.
Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base.
arXiv Detail & Related papers (2020-09-15T01:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.