Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices
- URL: http://arxiv.org/abs/2409.04249v2
- Date: Mon, 9 Sep 2024 18:25:01 GMT
- Title: Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices
- Authors: Xueyuan Han, Zinuo Cai, Yichu Zhang, Chongxin Fan, Junhan Liu, Ruhui Ma, Rajkumar Buyya,
- Abstract summary: This paper introduces PIPELOAD, a memory-efficient pipeline execution mechanism.
It reduces memory usage by incorporating dynamic memory management and minimizes inference latency.
We present Hermes, a framework optimized for large model inference on edge devices.
- Score: 19.96064012736243
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The application of Transformer-based large models has achieved numerous success in recent years. However, the exponential growth in the parameters of large models introduces formidable memory challenge for edge deployment. Prior works to address this challenge mainly focus on optimizing the model structure and adopting memory swapping methods. However, the former reduces the inference accuracy, and the latter raises the inference latency. This paper introduces PIPELOAD, a novel memory-efficient pipeline execution mechanism. It reduces memory usage by incorporating dynamic memory management and minimizes inference latency by employing parallel model loading. Based on PIPELOAD mechanism, we present Hermes, a framework optimized for large model inference on edge devices. We evaluate Hermes on Transformer-based models of different sizes. Our experiments illustrate that Hermes achieves up to 4.24 X increase in inference speed and 86.7% lower memory consumption than the state-of-the-art pipeline mechanism for BERT and ViT models, 2.58 X increase in inference speed and 90.3% lower memory consumption for GPT-style models.
Related papers
- Memory Caching: RNNs with Growing Memory [56.25483647131372]
We introduce Memory Caching (MC), a technique that enhances recurrent models by caching checkpoints of memory states (a.k.a. hidden states)<n>We propose four variants of MC, including gated aggregation and sparse selective mechanisms, and discuss their implications on both linear and deep memory modules.<n>The results indicate that while Transformers achieve the best accuracy, our MC variants show competitive performance, close the gap with Transformers, and performs better than state-of-the-art recurrent models.
arXiv Detail & Related papers (2026-02-27T18:53:41Z) - ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling [56.88966608455977]
ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters.<n>ZipMoE achieves up to $72.77%$ inference latency reduction and up to $6.76times$ higher throughput than the state-of-the-art systems.
arXiv Detail & Related papers (2026-01-29T02:51:59Z) - Artificial Hippocampus Networks for Efficient Long-Context Modeling [17.23148291364832]
Long-sequence modeling faces a trade-off between the efficiency of compressive fixed-size memory in RNN-like models and the fidelity of growing memory in attention-based Transformers.<n>Inspired by the Multi-Store Model in cognitive science, we introduce a memory framework of artificial neural networks.<n>Experiments on long-context benchmarks LV-Eval and InfiniteBench demonstrate that AHN-augmented models consistently outperform sliding window baselines.
arXiv Detail & Related papers (2025-10-08T17:59:55Z) - UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning [22.029614513198663]
Memory-layer architectures offer an appealing alternative with very few memory access.<n>We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap.<n>We demonstrate that UltraMemV2 performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access.
arXiv Detail & Related papers (2025-08-26T07:33:11Z) - Sparse Gradient Compression for Fine-Tuning Large Language Models [58.44973963468691]
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models.
High memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size.
We propose sparse compression gradient (SGC) to address these limitations.
arXiv Detail & Related papers (2025-02-01T04:18:28Z) - Memory Layers at Scale [67.00854080570979]
This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale.
On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the budget, as well as mixture-of-expert models when matched for both compute and parameters.
We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
arXiv Detail & Related papers (2024-12-12T23:56:57Z) - Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient [52.96232442322824]
Collaborative Decoding (CoDe) is a novel efficient decoding strategy tailored for the Visual Auto-Regressive ( VAR) framework.
CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales.
CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98.
arXiv Detail & Related papers (2024-11-26T15:13:15Z) - Ultra-Sparse Memory Network [8.927205198458994]
This work introduces UltraMem, incorporating large-scale, ultra-sparse memory layer to address these limitations.
We show that our method achieves state-of-the-art inference speed and model performance within a given computational budget.
arXiv Detail & Related papers (2024-11-19T09:24:34Z) - FluidML: Fast and Memory Efficient Inference Optimization [3.7676096626244986]
We present FluidML, a generic runtime memory management and optimization framework.
We show that FluidML can consistently reduce the end-to-end inference latency by up to 25.38% for popular language models.
We also show that FluidML can reduce peak memory usage by up to 41.47%, compared to state-of-the-art approaches.
arXiv Detail & Related papers (2024-11-14T07:16:23Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - AI and Memory Wall [81.06494558184049]
We show how memory bandwidth can become the dominant bottleneck for decoder models.
We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
arXiv Detail & Related papers (2024-03-21T04:31:59Z) - E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive.
We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation.
Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Efficiently Scaling Transformer Inference [8.196193683641582]
We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings.
We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices.
We achieve a low-batch-size latency of 29ms per token during generation (using int8 weight quantization) and a 76% MFU during large-batch-size processing of input tokens.
arXiv Detail & Related papers (2022-11-09T18:50:38Z) - DeepSpeed Inference: Enabling Efficient Inference of Transformer Models
at Unprecedented Scale [20.558091867632445]
DeepSpeed Inference is a comprehensive system solution for transformer model inference.
It reduces latency by up to 7.3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1.5x for throughput-oriented scenarios.
It can inference 25x larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over $50%$ of A6000 peak)
arXiv Detail & Related papers (2022-06-30T18:01:08Z) - A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental
Learning [56.450090618578]
Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement.
We show that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work.
We propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel.
arXiv Detail & Related papers (2022-05-26T08:24:01Z) - Memory-efficient Speech Recognition on Smart Devices [15.015948023187809]
Recurrent transducer models have emerged as a promising solution for speech recognition on smart devices.
These models access parameters from off-chip memory for every input time step which adversely effects device battery life and limits their usability on low-power devices.
We address transducer model's memory access concerns by optimizing their model architecture and designing novel recurrent cell designs.
arXiv Detail & Related papers (2021-02-23T07:43:45Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.