Related papers: Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration

Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration

URL: http://arxiv.org/abs/2511.18674v1
Date: Mon, 24 Nov 2025 01:13:52 GMT
Title: Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration
Authors: Alfredo Metere,
Abstract summary: Low-Rank GEMM is a novel approach that leverages low-rank matrix approximations to achieve sub-quadratic complexity.<n>System automatically adapts to hardware capabilities, selecting optimal decomposition methods.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large matrix multiplication is a cornerstone of modern machine learning workloads, yet traditional approaches suffer from cubic computational complexity (e.g., $\mathcal{O}(n^3)$ for a matrix of size $n\times n$). We present Low-Rank GEMM, a novel approach that leverages low-rank matrix approximations to achieve sub-quadratic complexity while maintaining hardware-accelerated performance through FP8 precision and intelligent kernel selection. On a NVIDIA RTX 4090, our implementation achieves up to 378 TFLOPS on matrices up to $N=20480$, providing 75\% memory savings and $7.8\times$ speedup over PyTorch FP32 for large matrices. The system automatically adapts to hardware capabilities, selecting optimal decomposition methods (SVD, randomized SVD) and precision levels based on matrix characteristics and available accelerators. Comprehensive benchmarking on NVIDIA RTX 4090 demonstrates that Low-Rank GEMM becomes the fastest approach for matrices $N\geq10240$, surpassing traditional cuBLAS implementations through memory bandwidth optimization rather than computational shortcuts.

Related papers

Evolution Strategies at the Hyperscale [57.75314521465674]
We introduce EGGROLL, an evolution strategies (ES) algorithm designed to scale backprop-free optimization to large population sizes.<n>ES is a set of powerful blackbox optimisation methods that can handle non-differentiable or noisy objectives.<n>EGGROLL overcomes these bottlenecks by generating random matrices $Ain mathbbRmtimes r, Bin mathbbRntimes r$ with $rll min(m,n)$ to form a low-rank matrix perturbation $A Btop$
arXiv Detail & Related papers (2025-11-20T18:56:05Z)
APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration [5.075697428779204]
Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance.<n>This is primarily due to the limited support for the GPU Cores, inefficient memory management, and inflexible kernel optimizations.<n>We propose a comprehensive acceleration scheme for arbitrary precision LLMs, namely APT-LLM.
arXiv Detail & Related papers (2025-08-26T14:48:29Z)
FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models [49.397861654088636]
We propose a two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces.<n>We show that our strategy achieves faster runtime and reduced memory usage by up to $25%$ across different model sizes.
arXiv Detail & Related papers (2025-05-23T14:37:00Z)
A Nonlinear Hash-based Optimization Method for SpMV on GPUs [19.6395697341071]
We highlight the effectiveness of hash-based techniques in optimizing sparse matrix reordering.<n>In this paper, we introduce the Hash-based Partition (HBP) format, a lightweight SpMV approach.<n>In experiments, our method offers an average speedup of 3.53 times compared to the sorting approach and 3.67 times compared to the dynamic programming method employed in Regu2D.
arXiv Detail & Related papers (2025-04-11T08:31:44Z)
KrADagrad: Kronecker Approximation-Domination Gradient Preconditioned Stochastic Optimization [69.47358238222586]
Second orderimations allow parameter update step size and direction to adapt to loss curvature. Recently, Shampoo introduced a Kronecker factored preconditioner to reduce these requirements. It takes inverse matrix roots of ill-conditioned matrices. This requires 64-bit precision, imposing strong hardware constraints.
arXiv Detail & Related papers (2023-05-30T21:15:45Z)
Optimizing Sparse Linear Algebra Through Automatic Format Selection and Machine Learning [0.0]
We introduce Morpheus-Oracle, a library that provides a lightweight ML auto-tuner capable of accurately predicting the optimal format across multiple backends. We achieve an average classification accuracy and balanced accuracy of 92.63% and 80.22% respectively across the available systems.
arXiv Detail & Related papers (2023-03-09T08:17:26Z)
Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z)
Monarch: Expressive Structured Matrices for Efficient and Accurate Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune. A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones. We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z)
Direct Spatial Implementation of Sparse Matrix Multipliers for Reservoir Computing [0.0]
Reservoir computing systems rely on the recurrent multiplication of a very large, sparse, fixed matrix. We argue that direct implementation of these fixed matrices minimizes the work performed in the computation. We present the structure of our bit-serial matrix multiplier, and evaluate using canonical signed digit representation to further reduce logic utilization.
arXiv Detail & Related papers (2021-01-21T23:16:22Z)
What if Neural Networks had SVDs? [66.91160214071088]
Various Neural Networks employ time-consuming matrix operations like matrix inversion. We present an algorithm that is fast enough to speed up several matrix operations.
arXiv Detail & Related papers (2020-09-29T12:58:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.