Fast Inference with Kronecker-Sparse Matrices
- URL: http://arxiv.org/abs/2405.15013v3
- Date: Fri, 13 Jun 2025 09:00:28 GMT
- Title: Fast Inference with Kronecker-Sparse Matrices
- Authors: Antoine Gonon, Léon Zheng, Pascal Carrivain, Quoc-Tung Le,
- Abstract summary: Existing GPU kernels for KS matrix multiplication suffer from high data movement costs.<n>We propose a fused, output-stationary GPU kernel that eliminates these overheads.<n>We demonstrate in FP32 end-to-end latency reductions of up to 22% in ViT-S/16 and 16% in GPT-2 medium.
- Score: 4.387337528923525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Kronecker-sparse (KS) matrices -- whose supports are Kronecker products of identity and all-ones blocks -- underpin the structure of Butterfly and Monarch matrices and offer the promise of more efficient models. However, existing GPU kernels for KS matrix multiplication suffer from high data movement costs, with up to 50% of time spent on memory-bound tensor permutations. We propose a fused, output-stationary GPU kernel that eliminates these overheads, reducing global memory traffic threefold. Across 600 KS patterns, our kernel achieves in FP32 a median speedup of x1.4 and lowers energy consumption by 15%. A simple heuristic based on KS pattern parameters predicts when our method outperforms existing ones. We release all code at github.com/PascalCarrivain/ksmm, including a PyTorch-compatible KSLinear layer, and demonstrate in FP32 end-to-end latency reductions of up to 22% in ViT-S/16 and 16% in GPT-2 medium.
Related papers
- SegMate: Asymmetric Attention-Based Lightweight Architecture for Efficient Multi-Organ Segmentation [18.60155862524957]
State-of-the-art models for medical image segmentation achieve excellent accuracy but require substantial computational resources.<n>We present SegMate, an efficient 2.5D framework that achieves state-of-the-art accuracy, while considerably reducing computational requirements.
arXiv Detail & Related papers (2026-02-27T10:50:55Z) - MonarchRT: Efficient Attention for Real-Time Video Generation [36.624688008552546]
We propose Monarch-RT, a structured a sparse attention parameterization for video diffusion models.<n>We achieve high expressivity while preserving computational efficiency.<n>Monarch-RT attains up to 95% attention sparsity with no loss in quality when applied to the state-of-the-art model Self-Forcing.
arXiv Detail & Related papers (2026-02-12T18:56:53Z) - Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs [11.45717904490388]
Recent advances in transformer-based foundation models have made them the default choice for many tasks.<n>Their rapidly growing size makes fitting a full model on a single GPU increasingly difficult and their computational cost prohibitive.<n>Block low-rank (BLR) compression techniques address this challenge by learning compact representations of weight matrices.
arXiv Detail & Related papers (2025-12-24T00:41:13Z) - Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication [6.557224606759151]
Modern accelerators are commonly equipped with cores and cores to accelerate sparse operators.<n>We show that utilizing one resource alone leads to inferior performance for sparse matrix multiplication, due to their respective limitations.<n>We propose a 2D-aware workload computation strategy find out the sweet point of task mapping operators, leveraging both the high performance of 2.9 cores and the low redundancy on cores.
arXiv Detail & Related papers (2025-06-28T01:50:13Z) - CommVQ: Commutative Vector Quantization for KV Cache Compression [50.37946553931796]
We propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference.<n>We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache.<n>Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook.
arXiv Detail & Related papers (2025-06-23T17:50:11Z) - A Simple Linear Patch Revives Layer-Pruned Large Language Models [58.056251480151104]
Layer pruning has emerged as a widely used technique for compressing large language models (LLMs)<n>textscLinearPatch fuses two operations into one matrix multiply at the pruning interface.<n>The patch can be further refined with 5K unlabeled samples via memory-efficient offline distillation, pushing the retention to 95.16% within only 30 minutes on a single GPU.
arXiv Detail & Related papers (2025-05-30T15:06:08Z) - Speedy MASt3R [68.47052557089631]
MASt3R redefines image matching as a 3D task by leveraging DUSt3R and introducing a fast reciprocal matching scheme.<n>Fast MASt3R achieves a 54% reduction in inference time (198 ms to 91 ms per image pair) without sacrificing accuracy.<n>This advancement enables real-time 3D understanding, benefiting applications like mixed reality navigation and large-scale 3D scene reconstruction.
arXiv Detail & Related papers (2025-03-13T03:56:22Z) - KurTail : Kurtosis-based LLM Quantization [51.24081396305435]
KurTail is a new post-training quantization scheme that mitigates outliers in the activations of large language models.<n>It offers a 13.3% boost in MMLU accuracy and a 15.5% drop in Wiki perplexity compared to QuaRot.<n>It also outperforms SpinQuant with a 2.6% MMLU gain and reduces perplexity by 2.9%, all while reducing the training cost.
arXiv Detail & Related papers (2025-03-03T12:43:06Z) - Unified Kernel-Segregated Transpose Convolution Operation [3.4558311080267954]
We introduce a unified kernel segregation approach that limits the usage of memory and computational resources.<n>The proposed method for the transpose convolution layers in the EB-GAN model demonstrates significant memory savings of up to 35 MB.
arXiv Detail & Related papers (2025-02-27T19:56:25Z) - FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores [6.404201720333765]
We propose FlashSparse, a novel approach to bridge the gap between sparse workloads and the TCU architecture.
Specifically, FlashSparse minimizes the sparse granularity for SpMM and SDDMM on TCUs through a novel swap-and-transpose matrix multiplication strategy.
We show that FlashSparse sets a new state-of-the-art for sparse matrix multiplications (geometric mean 5.5x speedup over DTC-SpMM and 3.22x speedup over RoDe)
arXiv Detail & Related papers (2024-12-15T01:12:33Z) - SMM-Conv: Scalar Matrix Multiplication with Zero Packing for Accelerated Convolution [4.14360329494344]
We present a novel approach for accelerating convolutions during inference for CPU-based architectures.
Our experiments with commonly used network architectures demonstrate a significant speedup compared to existing indirect methods.
arXiv Detail & Related papers (2024-11-23T21:43:38Z) - Expanding Sparse Tuning for Low Memory Usage [103.43560327427647]
We propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage.
To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices.
A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes.
arXiv Detail & Related papers (2024-11-04T04:58:20Z) - Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain.
We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance.
We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z) - Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with
SplitK work decomposition [0.44998333629984877]
We propose an implementation of an efficient fused matrix multiplication kernel for W4A16 quantized inference.
Our implementation shows improvement for the type of skinny matrix-matrix multiplications found in foundation model inference workloads.
arXiv Detail & Related papers (2024-01-05T19:17:55Z) - An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks.
The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions.
We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z) - ACNPU: A 4.75TOPS/W 1080P@30FPS Super Resolution Accelerator with
Decoupled Asymmetric Convolution [0.0502254944841629]
Deep learning-driven superresolution (SR) outperforms traditional techniques but also faces the challenge of high complexity and memory bandwidth.
This paper proposes an energy-efficient SR accelerator, ACNPU, to tackle this challenge.
The ACNPU enhances image quality by 0.34dB with a 27-layer model, but needs 36% less complexity than FSRCNN.
arXiv Detail & Related papers (2023-08-30T07:23:32Z) - KrADagrad: Kronecker Approximation-Domination Gradient Preconditioned
Stochastic Optimization [69.47358238222586]
Second orderimations allow parameter update step size and direction to adapt to loss curvature.
Recently, Shampoo introduced a Kronecker factored preconditioner to reduce these requirements.
It takes inverse matrix roots of ill-conditioned matrices.
This requires 64-bit precision, imposing strong hardware constraints.
arXiv Detail & Related papers (2023-05-30T21:15:45Z) - You Only Segment Once: Towards Real-Time Panoptic Segmentation [68.91492389185744]
YOSO is a real-time panoptic segmentation framework.
YOSO predicts masks via dynamic convolutions between panoptic kernels and image feature maps.
YOSO achieves 46.4 PQ, 45.6 FPS on COCO; 52.5 PQ, 22.6 FPS on Cityscapes; 38.0 PQ, 35.4 FPS on ADE20K.
arXiv Detail & Related papers (2023-03-26T07:55:35Z) - Optimized Sparse Matrix Operations for Reverse Mode Automatic
Differentiation [3.72826300260966]
We present an implementation of a CSR-based sparse matrix wrapper for PyTorch with acceleration for basic matrix operations, as well as automatic differentiability.
We also present several applications of the resulting sparse kernels to optimization problems, demonstrating ease of implementation and performance measurements versus their dense counterparts.
arXiv Detail & Related papers (2022-12-10T00:46:51Z) - Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications.
We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z) - Monarch: Expressive Structured Matrices for Efficient and Accurate
Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones.
We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z) - Robust 1-bit Compressive Sensing with Partial Gaussian Circulant
Matrices and Generative Priors [54.936314353063494]
We provide recovery guarantees for a correlation-based optimization algorithm for robust 1-bit compressive sensing.
We make use of a practical iterative algorithm, and perform numerical experiments on image datasets to corroborate our results.
arXiv Detail & Related papers (2021-08-08T05:28:06Z) - VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator.
textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z) - Direct Spatial Implementation of Sparse Matrix Multipliers for Reservoir
Computing [0.0]
Reservoir computing systems rely on the recurrent multiplication of a very large, sparse, fixed matrix.
We argue that direct implementation of these fixed matrices minimizes the work performed in the computation.
We present the structure of our bit-serial matrix multiplier, and evaluate using canonical signed digit representation to further reduce logic utilization.
arXiv Detail & Related papers (2021-01-21T23:16:22Z) - Sparse GPU Kernels for Deep Learning [24.94153856081836]
Deep learning applications have relatively moderate levels of sparsity that are not sufficient for existing sparse kernels to outperform their dense counterparts.
We develop high-performance GPU kernels for two sparse matrix operations widely applicable in neural networks.
Using our kernels, we demonstrate sparse Transformer and MobileNet models that achieve 1.2-2.1x speedups and up to 12.8x memory savings without sacrificing accuracy.
arXiv Detail & Related papers (2020-06-18T23:59:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.