AutoSAGE: Input-Aware CUDA Scheduling for Sparse GNN Aggregation (SpMM/SDDMM) and CSR Attention
- URL: http://arxiv.org/abs/2511.17594v1
- Date: Mon, 17 Nov 2025 18:25:51 GMT
- Title: AutoSAGE: Input-Aware CUDA Scheduling for Sparse GNN Aggregation (SpMM/SDDMM) and CSR Attention
- Authors: Aleksandar Stankovic,
- Abstract summary: AutoSAGE is an input-aware scheduler that chooses tiling and mapping per input.<n>On Reddit OGBN-Products it achieves up to 4.7x kernel-level speedups.
- Score: 52.20940151628735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sparse GNN aggregations (CSR SpMM/SDDMM) vary widely in performance with degree skew, feature width, and GPU micro-architecture. We present AutoSAGE, an input-aware CUDA scheduler that chooses tiling and mapping per input using a lightweight estimate refined by on-device micro-probes, with a guardrail that safely falls back to vendor kernels and a persistent cache for deterministic replay. AutoSAGE covers SpMM and SDDMM and composes into a CSR attention pipeline (SDDMM -> row-softmax -> SpMM). On Reddit and OGBN-Products, it matches vendor baselines at bandwidth-bound feature widths and finds gains at small widths; on synthetic sparsity and skew stress tests it achieves up to 4.7x kernel-level speedups. We release CUDA sources, Python bindings, a reproducible harness, and replayable cache logs.
Related papers
- Beyond GEMM-Centric NPUs: Enabling Efficient Diffusion LLM Sampling [14.471123653746275]
Diffusion Large Language Models (dLLMs) introduce iterative denoising to enable parallel token generation.<n>Our design employs lightweight non-GEMM vector primitives, in-place memory reuse strategies, and a decoupled mixed-precision memory hierarchy.
arXiv Detail & Related papers (2026-01-28T15:37:50Z) - PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs [57.790910044227935]
Video LLMs suffer from temporal inconsistency: small shifts in frame timing can flip attention and suppress relevant frames.<n>We present Phase Aggregated Smoothing (PAS), a training-free mechanism that applies small opposed phase offsets across heads and then aggregates their outputs.<n>Our analysis shows that the RoPE rotated logit can be approximated as a content dot product scaled by a time kernel; smoothing this kernel yields Lipschitz stability of attention to small temporal shifts; multi phase averaging attenuates high frequency ripples while preserving per-head spectra under Nyquist-valid sampling.
arXiv Detail & Related papers (2025-11-14T05:56:47Z) - Why Should the Server Do It All?: A Scalable, Versatile, and Model-Agnostic Framework for Server-Light DNN Inference over Massively Distributed Clients via Training-Free Intermediate Feature Compression [6.932768187544348]
We introduce SLICER, a retraining-free, architecture-agnostic framework that compresses IFs to reduce both communication and server load in split computing.<n>Across standard vision and LLM workloads, SLICER reduces uplink volume by up to 10x and server GPU time by up to 4.4x.
arXiv Detail & Related papers (2025-11-03T08:44:13Z) - Efficient Low Rank Attention for Long-Context Inference in Large Language Models [41.24530756499533]
Low Rank Query and Key attention (LRQK) is a framework that decomposes the full-precision query and key matrices into compact rank-(r) factors during the prefill stage.<n>By selecting only the top-(k) tokens and a small fixed set of recent tokens, LRQK employs a mixed GPU- CPU cache with a hit-and-miss mechanism that transfers only missing full-precision KV pairs.
arXiv Detail & Related papers (2025-10-25T11:43:27Z) - LiVOS: Light Video Object Segmentation with Gated Linear Matching [116.58237547253935]
LiVOS is a lightweight memory network that employs linear matching via linear attention.
For longer and higher-resolution videos, it matched STM-based methods with 53% less GPU memory and supports 4096p inference on a 32G consumer-grade GPU.
arXiv Detail & Related papers (2024-11-05T05:36:17Z) - MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels.
It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup.
MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z) - Tensor Slicing and Optimization for Multicore NPUs [2.670309629218727]
This paper proposes a compiler optimization pass for Multicore NPUs, called Slicing Optimization (TSO)
TSO identifies the best tensor slicing that minimizes execution time for a set of CNN models.
Results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models.
arXiv Detail & Related papers (2023-04-06T12:03:03Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z) - SMASH: Sparse Matrix Atomic Scratchpad Hashing [0.0]
In this thesis, we introduce a novel SpGEMM kernel implementation based on the row-wise product approach.
We leverage atomic instructions to merge intermediate partial products as they are generated.
Our kernel can achieve 9.4x speedup as compared to competing approaches.
arXiv Detail & Related papers (2021-05-29T00:22:50Z) - FusedMM: A Unified SDDMM-SpMM Kernel for Graph Embedding and Graph
Neural Networks [3.577310844634503]
We develop a fused matrix multiplication kernel that unifies sampled dense-dense matrix multiplication and sparse-dense matrix multiplication under a single operation called FusedMM.
By using user-defined functions, FusedMM can capture almost all computational patterns needed by popular graph embedding and GNN approaches.
arXiv Detail & Related papers (2020-11-07T18:06:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.