LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference   Acceleration
        - URL: http://arxiv.org/abs/2408.06003v1
 - Date: Mon, 12 Aug 2024 08:52:14 GMT
 - Title: LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference   Acceleration
 - Authors: Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang, 
 - Abstract summary: Mixed-precision matrix multiplication (mpGEMM) is a crucial yet under-explored operation that involves multiplying lower-precision weights with higher-precision activations.
Current hardware does not support mpGEMM, resulting in indirect and inefficient dequantization-based implementations.
We introduce LUT Core, a hardware co-design optimized for low-bit LLM inference.
 - Score: 10.608817382813786
 - License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
 - Abstract:   As large language model (LLM) inference demands ever-greater resources, there is a rapid growing trend of using low-bit weights to shrink memory usage and boost inference efficiency. However, these low-bit LLMs introduce the need for mixed-precision matrix multiplication (mpGEMM), which is a crucial yet under-explored operation that involves multiplying lower-precision weights with higher-precision activations. Unfortunately, current hardware does not natively support mpGEMM, resulting in indirect and inefficient dequantization-based implementations.   To address the mpGEMM requirements in low-bit LLMs, we explored the lookup table (LUT)-based approach for mpGEMM. However, a conventional LUT implementation falls short of its potential. To fully harness the power of LUT-based mpGEMM, we introduce LUT Tensor Core, a software-hardware co-design optimized for low-bit LLM inference. Specifically, we introduce software-based operator fusion and table symmetrization techniques to optimize table precompute and table storage, respectively. Then, LUT Tensor Core proposes the hardware design featuring an elongated tiling shape design to enhance table reuse and a bit-serial design to support various precision combinations in mpGEMM. Moreover, we design an end-to-end compilation stack with new instructions for LUT-based mpGEMM, enabling efficient LLM compilation and optimizations. The evaluation on low-bit LLMs (e.g., BitNet, LLAMA) shows that LUT Tensor Core achieves more than a magnitude of improvements on both compute density and energy efficiency. 
 
       
      
        Related papers
        - Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory   Constraints [14.341123057506827]
Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure demands significant computational resources.
This paper formulates LLM inference optimization as a multi-stage online scheduling problem.
We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design.
arXiv  Detail & Related papers  (2025-04-15T16:00:21Z) - NeuraLUT-Assemble: Hardware-aware Assembling of Sub-Neural Networks for   Efficient LUT Inference [2.7086888205833968]
Efficient neural networks (NNs) leveraging lookup tables (LUTs) have demonstrated significant potential for emerging AI applications.<n>Existing LUT-based designs suffer from accuracy degradation due to the large fan-in required by neurons being limited by the exponential scaling of LUT resources with input width.<n>We present NeuraLUT-Assemble, a novel framework that addresses these limitations by combining mixed-precision techniques with the assembly of larger neurons from smaller units.
arXiv  Detail & Related papers  (2025-04-01T09:52:38Z) - SparseLUT: Sparse Connectivity Optimization for Lookup Table-based Deep   Neural Networks [0.0]
This paper introduces SparseLUT, a connectivity-centric training technique tailored for LUT-based deep neural networks (DNNs)<n> Experimental results show consistent accuracy improvements across benchmarks, including up to a 2.13% increase on MNIST.<n>This is done without any hardware overhead and achieves state-of-the-art results for LUT-based DNNs.
arXiv  Detail & Related papers  (2025-03-17T05:21:54Z) - Tackling the Dynamicity in a Production LLM Serving System with SOTA   Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient   Meta-kernels [12.77187564450236]
We introduce XY-Serve, a versatile, Ascend native, end-to-end production large language model (LLM) serving system.
The core idea is an abstraction mechanism that smooths out the workload variability by decomposing computations into fine-grained meta primitives.
For GEMM, we introduce a virtual padding scheme that adapts to dynamic shape changes while using highly efficient GEMM primitives with assorted fixed tile sizes.
arXiv  Detail & Related papers  (2024-12-24T02:27:44Z) - Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on   Arm CPUs [0.8217552831952]
Large language models (LLMs) have transformed the way we think about language understanding and generation.
Group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process.
We present a groupwise non-uniform codebook-based quantization method for ultra-low-precision quantization of LLMs to better match non-uniform patterns in their weight distributions.
arXiv  Detail & Related papers  (2024-12-23T03:44:29Z) - MixPE: Quantization and Hardware Co-design for Efficient LLM Inference [16.42907854119748]
MixPE is a specialized mixed-precision processing element designed for efficient low-bit quantization in large language models.
We show that MixPE surpasses the state-of-the-art quantization accelerators by $2.6times$ speedup and $1.4times$ energy reduction.
arXiv  Detail & Related papers  (2024-11-25T07:34:53Z) - Expanding Sparse Tuning for Low Memory Usage [103.43560327427647]
We propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage.
To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices.
A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes.
arXiv  Detail & Related papers  (2024-11-04T04:58:20Z) - EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace   Low-Rank Approximation [84.70637613266835]
EoRA is a fine-tuning-free method that augments compressed Large Language Models with low-rank matrices.<n>EoRA consistently outperforms prior training-free low rank methods in recovering the accuracy of compressed LLMs.
arXiv  Detail & Related papers  (2024-10-28T17:59:03Z) - Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with   System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv  Detail & Related papers  (2024-10-24T19:48:51Z) - Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding.<n>PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models.<n>Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv  Detail & Related papers  (2024-10-17T11:46:33Z) - Designing Efficient LLM Accelerators for Edge Devices [1.4128048241287314]
Large Language Models (LLMs) can be deployed on resource-constrained edge devices to reduce reliance on network connections and provide more privacy.
To address this issue, designing new and efficient edge accelerators for LLM inference is crucial.
We propose SECDA-LLM, that utilizes the SECDA methodology to streamline the process of designing, integrating, and deploying efficient FPGA-based LLM accelerators.
arXiv  Detail & Related papers  (2024-08-01T11:06:05Z) - Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.
At batch sizes  32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv  Detail & Related papers  (2024-07-15T17:55:42Z) - T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on   Edge [11.305778938818937]
We introduce T-MAC, an innovative lookup table(LUT)-based method for efficient low-bit LLM (i.e., weight-quantized LLM) inference on CPUs.
T-MAC directly supports mpGEMM without dequantization, while simultaneously eliminating multiplications and reducing additions required.
 evaluated on low-bit Llama and BitNet models, T-MAC demonstrates up to 4x increase in throughput and 70% reduction in energy consumption.
arXiv  Detail & Related papers  (2024-06-25T08:38:38Z) - EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge   Devices via Layerwise Unified Compression and Adaptive Layer Tuning and   Voting [12.006890185810322]
We introduce a computation- and memory-efficient LLM tuning framework, called Edge-LLM, to facilitate affordable and effective LLM adaptation on edge devices.
Specifically, Edge-LLM features three core components: (1) a layer-wise unified compression (LUC) technique to reduce the computation overhead by generating layer-wise pruning sparsity and quantization bit-width policies, (2) an adaptive layer tuning and voting scheme to reduce the memory overhead by reducing the backpropagation depth, and (3) a complementary hardware scheduling strategy to handle the irregular computation patterns introduced by LUC and adaptive layer tuning.
arXiv  Detail & Related papers  (2024-06-22T06:51:47Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large   Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs)
Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths.
This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv  Detail & Related papers  (2024-05-23T16:21:48Z) - Revisiting Zeroth-Order Optimization for Memory-Efficient LLM   Fine-Tuning: A Benchmark [166.40879020706151]
This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during fine-tuning.
Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques.
Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance.
arXiv  Detail & Related papers  (2024-02-18T14:08:48Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv  Detail & Related papers  (2024-02-06T09:26:34Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv  Detail & Related papers  (2024-01-11T18:54:44Z) - Sparse Universal Transformer [64.78045820484299]
The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers.
This paper proposes the Sparse Universal Transformer (SUT), which leverages Sparse Mixture of Experts (SMoE) and a new stick-breaking-based dynamic halting mechanism.
arXiv  Detail & Related papers  (2023-10-11T00:38:57Z) - LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient   Inference in Large-Scale Generative Language Models [9.727062803700264]
We introduce LUT-GEMM, an efficient kernel for quantized matrix multiplication.
LUT-GEMM eliminates the resource-intensive dequantization process and reduces computational costs.
We show experimentally that when applied to the OPT-175B model with 3-bit quantization, LUT-GEMM substantially accelerates token generation latency.
arXiv  Detail & Related papers  (2022-06-20T03:48:17Z) - Logic Shrinkage: Learned FPGA Netlist Sparsity for Efficient Neural
  Network Inference [3.2296078260106174]
We propose the learned optimization of such LUT-based topologies, resulting in higher-efficiency designs.
Existing implementations of this class of architecture require the manual specification of the number of inputs per LUT, K.
We propose logic shrinkage, a fine-grained netlist pruning methodology enabling K to be automatically learned for every LUT in a neural network targeted for FPGA inference.
arXiv  Detail & Related papers  (2021-12-04T14:23:24Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.