Related papers: Low-Rank Compression for IMC Arrays

Low-Rank Compression for IMC Arrays

URL: http://arxiv.org/abs/2502.07820v1
Date: Mon, 10 Feb 2025 08:57:39 GMT
Title: Low-Rank Compression for IMC Arrays
Authors: Kang Eun Jeon, Johnny Rhe, Jong Hwan Ko,
Abstract summary: Low-rank compression presents its own set of challenges, namely i.e. suboptimal IMC array utilization and ii. compromised accuracy. Our proposed method achieves up to 2.5x speedup or +20.9% accuracy boost over existing pruning techniques.
Score: 7.315486393304946
License:
Abstract: In this study, we address the challenge of low-rank model compression in the context of in-memory computing (IMC) architectures. Traditional pruning approaches, while effective in model size reduction, necessitate additional peripheral circuitry to manage complex dataflows and mitigate dislocation issues, leading to increased area and energy overheads. To circumvent these drawbacks, we propose leveraging low-rank compression techniques, which, unlike pruning, streamline the dataflow and seamlessly integrate with IMC architectures. However, low-rank compression presents its own set of challenges, namely i) suboptimal IMC array utilization and ii) compromised accuracy. To address these issues, we introduce a novel approach i) employing shift and duplicate kernel (SDK) mapping technique, which exploits idle IMC columns for parallel processing, and ii) group low-rank convolution, which mitigates the information imbalance in the decomposed matrices. Our experimental results demonstrate that our proposed method achieves up to 2.5x speedup or +20.9% accuracy boost over existing pruning techniques.

Related papers

Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z)
Minimum Entropy Coupling with Bottleneck [22.409716686394525]
This paper investigates a novel lossy compression framework operating under logarithmic loss. It is especially relevant for applications that require joint compression and retrieval, and in scenarios involving distributional shifts due to processing. We show that the proposed formulation extends the classical minimum entropy coupling framework by integrating a bottleneck.
arXiv Detail & Related papers (2024-10-29T02:19:07Z)
Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders. We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
Communication-Efficient Distributed Learning with Local Immediate Error Compensation [95.6828475028581]
We propose the Local Immediate Error Compensated SGD (LIEC-SGD) optimization algorithm. LIEC-SGD is superior to previous works in either the convergence rate or the communication cost.
arXiv Detail & Related papers (2024-02-19T05:59:09Z)
Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models [9.91972450276408]
This paper introduces an innovative approach for the parametric and practical compression of Large Language Models (LLMs) based on reduced order modelling. Our method represents a significant advancement in model compression by leveraging matrix decomposition, demonstrating superior efficacy compared to the prevailing state-of-the-art structured pruning method.
arXiv Detail & Related papers (2023-12-12T07:56:57Z)
Low-Rank Prune-And-Factorize for Language Model Compression [18.088550230146247]
Matrix factorization fails to retain satisfactory performance under moderate to high compression rate. We propose two techniques: sparsity-aware SVD and mixed-rank fine-tuning.
arXiv Detail & Related papers (2023-06-25T07:38:43Z)
Error Feedback Can Accurately Compress Preconditioners [43.60787513716217]
Leveraging second-order information about the loss at the scale of deep networks is one of the main lines of approach for improving the performance of currents for deep learning. Yet, existing approaches for accurate full-matrix preconditioning, such as Full-Matrix Adagrad (GGT) or Matrix-Free Approximate Curvature (M-FAC) suffer from massive storage costs when applied even to small-scale models. In this paper, we address this issue via a novel and efficient error-feedback technique that can be applied to compress preconditioners by up to two orders of magnitude in practice, without loss of convergence.
arXiv Detail & Related papers (2023-06-09T17:58:47Z)
Efficient Micro-Structured Weight Unification and Pruning for Neural Network Compression [56.83861738731913]
Deep Neural Network (DNN) models are essential for practical applications, especially for resource limited devices. Previous unstructured or structured weight pruning methods can hardly truly accelerate inference. We propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration.
arXiv Detail & Related papers (2021-06-15T17:22:59Z)
Structured Sparsification with Joint Optimization of Group Convolution and Channel Shuffle [117.95823660228537]
We propose a novel structured sparsification method for efficient network compression. The proposed method automatically induces structured sparsity on the convolutional weights. We also address the problem of inter-group communication with a learnable channel shuffle mechanism.
arXiv Detail & Related papers (2020-02-19T12:03:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.