FluidML: Fast and Memory Efficient Inference Optimization
- URL: http://arxiv.org/abs/2411.09242v1
- Date: Thu, 14 Nov 2024 07:16:23 GMT
- Title: FluidML: Fast and Memory Efficient Inference Optimization
- Authors: Jinjie Liu, Hang Qiu,
- Abstract summary: We present FluidML, a generic runtime memory management and optimization framework.
We show that FluidML can consistently reduce the end-to-end inference latency by up to 25.38% for popular language models.
We also show that FluidML can reduce peak memory usage by up to 41.47%, compared to state-of-the-art approaches.
- Score: 3.7676096626244986
- License:
- Abstract: Machine learning models deployed on edge devices have enabled numerous exciting new applications, such as humanoid robots, AR glasses, and autonomous vehicles. However, the computing resources available on these edge devices are not catching up with the ever-growing number of parameters in these models. As the models become bigger and more complicated, the novel yet sophisticated structure challenges the inference runtime optimization. We present FluidML, a generic runtime memory management and optimization framework that can flexibly transform the model execution blueprint to achieve faster and more memory-efficient inference. Evaluations across different platforms show that FluidML can consistently reduce the end-to-end inference latency by up to 25.38% for popular language models and reduce peak memory usage by up to 41.47%, compared to state-of-the-art approaches. FluidML is of ~30K line of codes, built for general-purpose usage, and will be released as an open-source inference runtime optimization framework to the community.
Related papers
- Sparse Gradient Compression for Fine-Tuning Large Language Models [58.44973963468691]
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models.
High memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size.
We propose sparse compression gradient (SGC) to address these limitations.
arXiv Detail & Related papers (2025-02-01T04:18:28Z) - Resource-Efficient Transformer Architecture: Optimizing Memory and Execution Time for Real-Time Applications [0.1874930567916036]
This paper describes a memory-efficient transformer model designed to drive a reduction in memory usage and execution time by substantial orders of magnitude without impairing the model's performance near that of the original model.
Results include a 52% reduction in memory usage and a 33% decrease in execution time, resulting in better efficiency than state-of-the-art models.
arXiv Detail & Related papers (2024-12-25T14:41:23Z) - MOFHEI: Model Optimizing Framework for Fast and Efficient Homomorphically Encrypted Neural Network Inference [0.8388591755871735]
Homomorphic Encryption (HE) enables us to perform machine learning tasks over encrypted data.
We propose MOFHEI, a framework that optimize the model to make HE-based neural network inference, fast and efficient.
Our framework achieves up to 98% pruning ratio on LeNet, eliminating up to 93% of the required HE operations for performing PI.
arXiv Detail & Related papers (2024-12-10T22:44:54Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines [6.381783966294295]
Open-source large language models (LLMs) enable developers to create AI-based solutions while maintaining control over aspects such as privacy and compliance.
We analyze the performance, particularly the throughput (tokens generated per unit of time) of 20 LLMs using two inference libraries: vLLM and HuggingFace's pipelines.
arXiv Detail & Related papers (2024-08-02T06:56:59Z) - Online Adaptation of Language Models with a Memory of Amortized Contexts [82.02369596879817]
Memory of Amortized Contexts (MAC) is an efficient and effective online adaptation framework for large language models.
We show how MAC can be combined with and improve the performance of popular alternatives such as retrieval augmented generations.
arXiv Detail & Related papers (2024-03-07T08:34:57Z) - LMUFormer: Low Complexity Yet Powerful Spiking Model With Legendre
Memory Units [5.830814457423021]
Transformer models have demonstrated high accuracy in numerous applications but have high complexity and lack sequential processing capability.
We show how architectural modifications to a recurrent model can help push its performance toward Transformer models.
We present a spiking version of this architecture, which introduces the benefit of states within the patch embedding and channel mixer modules.
arXiv Detail & Related papers (2024-01-20T01:10:18Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Cheaply Evaluating Inference Efficiency Metrics for Autoregressive
Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing.
LLMs are extremely computationally expensive, even at inference time.
We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Efficiently Scaling Transformer Inference [8.196193683641582]
We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings.
We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices.
We achieve a low-batch-size latency of 29ms per token during generation (using int8 weight quantization) and a 76% MFU during large-batch-size processing of input tokens.
arXiv Detail & Related papers (2022-11-09T18:50:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.