SoftmAP: Software-Hardware Co-design for Integer-Only Softmax on Associative Processors
- URL: http://arxiv.org/abs/2411.17847v1
- Date: Tue, 26 Nov 2024 20:00:54 GMT
- Title: SoftmAP: Software-Hardware Co-design for Integer-Only Softmax on Associative Processors
- Authors: Mariam Rakka, Jinhao Li, Guohao Dai, Ahmed Eltawil, Mohammed E. Fouda, Fadi Kurdahi,
- Abstract summary: Non-linear operators like Softmax and Layernorm remain bottlenecks due to their sensitivity to quantization.
We propose SoftmAP, a software-hardware co-design methodology that implements an integer-only low-precision Softmax using In-Memory Compute (IMC) hardware.
- Score: 1.8999662338457695
- License:
- Abstract: Recent research efforts focus on reducing the computational and memory overheads of Large Language Models (LLMs) to make them feasible on resource-constrained devices. Despite advancements in compression techniques, non-linear operators like Softmax and Layernorm remain bottlenecks due to their sensitivity to quantization. We propose SoftmAP, a software-hardware co-design methodology that implements an integer-only low-precision Softmax using In-Memory Compute (IMC) hardware. Our method achieves up to three orders of magnitude improvement in the energy-delay product compared to A100 and RTX3090 GPUs, making LLMs more deployable without compromising performance.
Related papers
- Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs [0.8217552831952]
Large language models (LLMs) have transformed the way we think about language understanding and generation.
Group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process.
We present a groupwise non-uniform codebook-based quantization method for ultra-low-precision quantization of LLMs to better match non-uniform patterns in their weight distributions.
arXiv Detail & Related papers (2024-12-23T03:44:29Z) - DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding.
PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models.
Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z) - SLaNC: Static LayerNorm Calibration [1.2016264781280588]
Quantization to lower precision formats naturally poses a number of challenges caused by the limited range of the available value representations.
In this article, we propose a computationally-efficient scaling technique that can be easily applied to Transformer models during inference.
Our method suggests a straightforward way of scaling the LayerNorm inputs based on the static weights of the immediately preceding linear layers.
arXiv Detail & Related papers (2024-10-14T14:32:55Z) - LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models.
We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization.
Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters [14.029865087214436]
Self-attention mechanism distinguishes transformer-based large language models (LLMs) apart from convolutional and recurrent neural networks.
achieving real-time LLM inference on silicon remains challenging due to the extensive use of Softmax in self-attention.
We propose Constant Softmax (ConSmax), a software- hardware co-design that serves as an efficient alternative to Softmax.
arXiv Detail & Related papers (2024-01-31T17:52:52Z) - A Speed Odyssey for Deployable Quantization of LLMs [19.12232212257625]
We introduce a hardware-centric approach in the construction of quantization algorithms.
Our method, OdysseyLLM, comes with a novel W4A8 kernel implementation called FastGEMM and a combined recipe of quantization strategies.
Experiments manifest the superiority of our W4A8 method which brings the actual speed boosting up to textbf4$times$ compared to Hugging Face FP16 and textbf2.23$times$ vs. the state-of-art inference engine.
arXiv Detail & Related papers (2023-11-16T04:11:19Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.