Related papers: The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference

The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference

URL: http://arxiv.org/abs/2506.11728v1
Date: Fri, 13 Jun 2025 12:40:16 GMT
Title: The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference
Authors: Héctor Martínez, Adrián Castelló, Francisco D. Igual, Enrique S. Quintana-Ortí,
Abstract summary: Deep learning (DL) has led to a shift from traditional 64-bit floating point (FP64) computations toward reduced-precision formats.<n>This paper revisits traditional high-performance gemm and describes strategies for adapting it to mixed-precision integer arithmetic.
Score: 0.9954176833299684
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in deep learning (DL) have led to a shift from traditional 64-bit floating point (FP64) computations toward reduced-precision formats, such as FP16, BF16, and 8- or 16-bit integers, combined with mixed-precision arithmetic. This transition enhances computational throughput, reduces memory and bandwidth usage, and improves energy efficiency, offering significant advantages for resource-constrained edge devices. To support this shift, hardware architectures have evolved accordingly, now including adapted ISAs (Instruction Set Architectures) that expose mixed-precision vector units and matrix engines tailored for DL workloads. At the heart of many DL and scientific computing tasks is the general matrix-matrix multiplication gemm, a fundamental kernel historically optimized using axpy vector instructions on SIMD (single instruction, multiple data) units. However, as hardware moves toward mixed-precision dot-product-centric operations optimized for quantized inference, these legacy approaches are being phased out. In response to this, our paper revisits traditional high-performance gemm and describes strategies for adapting it to mixed-precision integer (MIP) arithmetic across modern ISAs, including x86_64, ARM, and RISC-V. Concretely, we illustrate novel micro-kernel designs and data layouts that better exploit today's specialized hardware and demonstrate significant performance gains from MIP arithmetic over floating-point implementations across three representative CPU architectures. These contributions highlight a new era of gemm optimization-driven by the demands of DL inference on heterogeneous architectures, marking what we term as the "Cambrian period" for matrix multiplication.

Related papers

Orthogonal Finetuning Made Scalable [87.49040247077389]
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment.<n>We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity.<n>We propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic.<n>These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance.
arXiv Detail & Related papers (2025-06-24T17:59:49Z)
Scaling Probabilistic Circuits via Monarch Matrices [109.65822339230853]
Probabilistic Circuits (PCs) are tractable representations of probability distributions.<n>We propose a novel sparse and structured parameterization for the sum blocks in PCs.
arXiv Detail & Related papers (2025-06-14T07:39:15Z)
gECC: A GPU-based high-throughput framework for Elliptic Curve Cryptography [15.39096542261856]
Elliptic Curve Cryptography (ECC) is an encryption method that provides security comparable to traditional techniques like Rivest-Shamir-Adleman (RSA)<n> ECC is still hindered by the significant performance overhead associated with elliptic curve (EC) operations.<n>This paper presents gECC, a versatile framework for ECC optimized for GPU architectures.
arXiv Detail & Related papers (2024-12-22T01:50:50Z)
SMM-Conv: Scalar Matrix Multiplication with Zero Packing for Accelerated Convolution [4.14360329494344]
We present a novel approach for accelerating convolutions during inference for CPU-based architectures. Our experiments with commonly used network architectures demonstrate a significant speedup compared to existing indirect methods.
arXiv Detail & Related papers (2024-11-23T21:43:38Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders. We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z)
Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores [3.6385567224218556]
Large language models (LLMs) have been widely applied but face challenges in efficient inference. We introduce a novel bipolar-INT data format that facilitates parallel computing and supports symmetric quantization. We implement an arbitrary precision matrix multiplication scheme that decomposes and recovers at the bit level, enabling flexible precision.
arXiv Detail & Related papers (2024-09-26T14:17:58Z)
Mixed-precision Neural Networks on RISC-V Cores: ISA extensions for Multi-Pumped Soft SIMD Operations [5.847997723738113]
Modern embedded microprocessors provide very limited support for mixed-precision NNs. We introduce a hardware-software co-design framework that enables cooperative hardware design, mixed-precision quantization, ISA extensions and inference. Our framework can achieve, on average, 15x energy reduction for less than 1% accuracy loss and outperforms the ISA-agnostic state-of-the-art RISC-V cores.
arXiv Detail & Related papers (2024-07-19T12:54:04Z)
Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval. We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z)
DAISM: Digital Approximate In-SRAM Multiplier-based Accelerator for DNN Training and Inference [4.718504401468233]
PIM solutions rely either on novel memory technologies that have yet to mature or bit-serial computations that have significant performance overhead and scalability issues. Our work proposes an in-SRAM digital multiplier, that uses a conventional memory to perform bit-parallel computations, leveraging multiple wordlines activation. We then introduce DAISM, an architecture leveraging this multiplier, which achieves up to two orders of magnitude higher area efficiency compared to the SOTA counterparts, with competitive energy efficiency.
arXiv Detail & Related papers (2023-05-12T10:58:21Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers. A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z)
Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration [14.958793135751149]
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM) Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead. We address a key architectural challenge with structural sparsity: how to provide support for a range of sparsity levels while maintaining high utilization of the hardware.
arXiv Detail & Related papers (2020-09-04T20:17:42Z)
Scalable Second Order Optimization for Deep Learning [34.12384996822749]
We present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad) Our novel design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models, consisting of a multicore CPU coupled with multiple accelerator units. We demonstrate superior performance compared to state-of-the-art on very large learning tasks such as machine translation with Transformers, language modeling with BERT, click-through rate prediction on Criteo, and image classification on ImageNet with ResNet-50.
arXiv Detail & Related papers (2020-02-20T20:51:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.