Optimization of Armv9 architecture general large language model inference performance based on Llama.cpp
- URL: http://arxiv.org/abs/2406.10816v1
- Date: Sun, 16 Jun 2024 06:46:25 GMT
- Title: Optimization of Armv9 architecture general large language model inference performance based on Llama.cpp
- Authors: Longhao Chen, Yina Zhao, Qiangjun Xie, Qinghua Sheng,
- Abstract summary: This article optimize the inference performance of the Qwen-1.8B model by performing Int8 quantization, vectorizing some operators in llama, and modifying the compilation script.
On the Yitian 710 experimental platform, the prefill performance is increased by 1.6 times, the decoding performance is increased by 24 times, the memory usage is reduced to 1/5 of the original, and the accuracy loss is almost negligible.
- Score: 0.3749861135832073
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This article optimizes the inference performance of the Qwen-1.8B model by performing Int8 quantization, vectorizing some operators in llama.cpp, and modifying the compilation script to improve the compiler optimization level. On the Yitian 710 experimental platform, the prefill performance is increased by 1.6 times, the decoding performance is increased by 24 times, the memory usage is reduced to 1/5 of the original, and the accuracy loss is almost negligible.
Related papers
- Iterative or Innovative? A Problem-Oriented Perspective for Code Optimization [81.88668100203913]
Large language models (LLMs) have demonstrated strong capabilities in solving a wide range of programming tasks.
In this paper, we explore code optimization with a focus on performance enhancement, specifically aiming to optimize code for minimal execution time.
arXiv Detail & Related papers (2024-06-17T16:10:10Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Large Language Models for Compiler Optimization [22.52765975286403]
We present a transformer model trained from scratch to optimize LLVM assembly for code size.
We ask the model to predict the instruction counts before and after optimization, and the optimized code itself.
Our approach achieves a 3.0% improvement in reducing instruction counts over the compiler.
arXiv Detail & Related papers (2023-09-11T22:11:46Z) - CHERI Performance Enhancement for a Bytecode Interpreter [0.0]
We show that it is possible to eliminate certain kinds of software-induced runtime overhead that occur due to the larger size of CHERI capabilities (128 bits) relative to native pointers (generally 64 bits)
The worst-case slowdowns are greatly improved, from 100x (before optimization) to 2x (after optimization)
arXiv Detail & Related papers (2023-08-09T17:12:23Z) - Slapo: A Schedule Language for Progressive Optimization of Large Deep
Learning Model Training [17.556432199389615]
Slapo is a schedule language that decouples the execution of a tensor-level operator from its arithmetic definition.
We show that Slapo can improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs.
arXiv Detail & Related papers (2023-02-16T00:34:53Z) - Learning Performance-Improving Code Edits [107.21538852090208]
We introduce a framework for adapting large language models (LLMs) to high-level program optimization.
First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs.
For prompting, we propose retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play.
arXiv Detail & Related papers (2023-02-15T18:59:21Z) - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers.
A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - Learning to Superoptimize Real-world Programs [79.4140991035247]
We propose a framework to learn to superoptimize real-world programs by using neural sequence-to-sequence models.
We introduce the Big Assembly benchmark, a dataset consisting of over 25K real-world functions mined from open-source projects in x86-64 assembly.
arXiv Detail & Related papers (2021-09-28T05:33:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.