Related papers: HERA: High-efficiency Matrix Compression via Element Replacement

HERA: High-efficiency Matrix Compression via Element Replacement

URL: http://arxiv.org/abs/2407.03637v1
Date: Thu, 4 Jul 2024 05:13:58 GMT
Title: HERA: High-efficiency Matrix Compression via Element Replacement
Authors: Yanshu Wang, Wang Li, Tong Yang,
Abstract summary: Large Language Models (LLMs) have advanced natural language processing tasks such as machine translation, text generation, and sentiment analysis. Their large size, often consisting of billions of parameters, poses challenges for storage, computation, and deployment. We propose HERA, a novel algorithm that employs Element Replacement for compressing matrix.
Score: 5.858734684979008
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have significantly advanced natural language processing tasks such as machine translation, text generation, and sentiment analysis. However, their large size, often consisting of billions of parameters, poses challenges for storage, computation, and deployment, particularly in resource-constrained environments like mobile devices and edge computing platforms. Additionally, the key-value (k-v) cache used to speed up query processing requires substantial memory and storage, exacerbating these challenges. Vector databases have emerged as a crucial technology to efficiently manage and retrieve the high-dimensional vectors produced by LLMs, facilitating faster data access and reducing computational demands. Effective compression and quantization techniques are essential to address these challenges, as they reduce the memory footprint and computational requirements without significantly compromising performance. Traditional methods that uniformly map parameters to compressed spaces often fail to account for the uneven distribution of parameters, leading to considerable accuracy loss. Therefore, innovative approaches are needed to achieve better compression ratios while preserving model performance. In this work, we propose HERA, a novel algorithm that employs heuristic Element Replacement for compressing matrix. HERA systematically replaces elements within the model using heuristic methods, which simplifies the structure of the model and makes subsequent compression more effective. By hierarchically segmenting, compressing, and reorganizing the matrix dataset, our method can effectively reduce the quantization error to 12.3% of the original at the same compression ratio.

Related papers

SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models [14.310720048047136]
ALPS is an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned gradient conjugate-based post-processing step. Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency. On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.
arXiv Detail & Related papers (2024-06-12T02:57:41Z)
Athena: Efficient Block-Wise Post-Training Quantization for Large Language Models Using Second-Order Matrix Derivative Information [5.756323337411276]
Large Language Models (LLMs) have advanced natural language processing tasks such as machine translation, text generation, and sentiment analysis. Their large size, often consisting of billions of parameters, poses challenges for storage, computation, and deployment. We propose Athena, a novel algorithm for efficient block-wise post-training quantization of LLMs.
arXiv Detail & Related papers (2024-05-24T03:14:29Z)
A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV) Model compression methods reduce the memory and computational cost of Transformer. This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z)
LoMA: Lossless Compressed Memory Attention [0.0]
Lossless Compressed Memory Attention (LoMA) is a novel approach to reduce memory and computational demands during autoregressive generation. LoMA incorporates a specialized training or fine-tuning precedure alongside an autoregressive generation algorithm optimized for the compressed context. Experimental validation has demonstrated that LoMA significantly reducing computational consumption and memory usage.
arXiv Detail & Related papers (2024-01-16T09:18:46Z)
Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
AQLM is first scheme that is optimal in terms of accuracy-vs-model-size when compressing to less than 3 bits per parameter. We provide fast GPU and CPU implementations of AQLM for token generation.
arXiv Detail & Related papers (2024-01-11T18:54:44Z)
Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models [9.91972450276408]
This paper introduces an innovative approach for the parametric and practical compression of Large Language Models (LLMs) based on reduced order modelling. Our method represents a significant advancement in model compression by leveraging matrix decomposition, demonstrating superior efficacy compared to the prevailing state-of-the-art structured pruning method.
arXiv Detail & Related papers (2023-12-12T07:56:57Z)
Blockwise Compression of Transformer-based Models without Retraining [6.118476907408718]
We propose BCT, a framework of blockwise compression for transformers without retraining. Unlike layerwise compression methods, BCT achieves finer compression of the entire transformer by operating blockwise. BCT effectively compresses all components of the model, including but not limited to the embedding, matrix multiplication, GELU, Softmax, layer normalization, and intermediate results.
arXiv Detail & Related papers (2023-04-04T02:55:40Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
Efficient Micro-Structured Weight Unification and Pruning for Neural Network Compression [56.83861738731913]
Deep Neural Network (DNN) models are essential for practical applications, especially for resource limited devices. Previous unstructured or structured weight pruning methods can hardly truly accelerate inference. We propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration.
arXiv Detail & Related papers (2021-06-15T17:22:59Z)
Neural Network Compression for Noisy Storage Devices [71.4102472611862]
Conventionally, model compression and physical storage are decoupled. This approach forces the storage to treat each bit of the compressed model equally, and to dedicate the same amount of resources to each bit. We propose a radically different approach that: (i) employs analog memories to maximize the capacity of each memory cell, and (ii) jointly optimize model compression and physical storage to maximize memory utility.
arXiv Detail & Related papers (2021-02-15T18:19:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.