Related papers: AutoSAGE: Input-Aware CUDA Scheduling for Sparse GNN Aggregation (SpMM/SDDMM) and CSR Attention

AutoSAGE: Input-Aware CUDA Scheduling for Sparse GNN Aggregation (SpMM/SDDMM) and CSR Attention

URL: http://arxiv.org/abs/2511.17594v1
Date: Mon, 17 Nov 2025 18:25:51 GMT
Title: AutoSAGE: Input-Aware CUDA Scheduling for Sparse GNN Aggregation (SpMM/SDDMM) and CSR Attention
Authors: Aleksandar Stankovic,
Abstract summary: AutoSAGE is an input-aware scheduler that chooses tiling and mapping per input.<n>On Reddit OGBN-Products it achieves up to 4.7x kernel-level speedups.
Score: 52.20940151628735
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sparse GNN aggregations (CSR SpMM/SDDMM) vary widely in performance with degree skew, feature width, and GPU micro-architecture. We present AutoSAGE, an input-aware CUDA scheduler that chooses tiling and mapping per input using a lightweight estimate refined by on-device micro-probes, with a guardrail that safely falls back to vendor kernels and a persistent cache for deterministic replay. AutoSAGE covers SpMM and SDDMM and composes into a CSR attention pipeline (SDDMM -> row-softmax -> SpMM). On Reddit and OGBN-Products, it matches vendor baselines at bandwidth-bound feature widths and finds gains at small widths; on synthetic sparsity and skew stress tests it achieves up to 4.7x kernel-level speedups. We release CUDA sources, Python bindings, a reproducible harness, and replayable cache logs.

Related papers

Beyond GEMM-Centric NPUs: Enabling Efficient Diffusion LLM Sampling [14.471123653746275]
Diffusion Large Language Models (dLLMs) introduce iterative denoising to enable parallel token generation.<n>Our design employs lightweight non-GEMM vector primitives, in-place memory reuse strategies, and a decoupled mixed-precision memory hierarchy.
arXiv Detail & Related papers (2026-01-28T15:37:50Z)
PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs [57.790910044227935]
Video LLMs suffer from temporal inconsistency: small shifts in frame timing can flip attention and suppress relevant frames.<n>We present Phase Aggregated Smoothing (PAS), a training-free mechanism that applies small opposed phase offsets across heads and then aggregates their outputs.<n>Our analysis shows that the RoPE rotated logit can be approximated as a content dot product scaled by a time kernel; smoothing this kernel yields Lipschitz stability of attention to small temporal shifts; multi phase averaging attenuates high frequency ripples while preserving per-head spectra under Nyquist-valid sampling.
arXiv Detail & Related papers (2025-11-14T05:56:47Z)
Why Should the Server Do It All?: A Scalable, Versatile, and Model-Agnostic Framework for Server-Light DNN Inference over Massively Distributed Clients via Training-Free Intermediate Feature Compression [6.932768187544348]
We introduce SLICER, a retraining-free, architecture-agnostic framework that compresses IFs to reduce both communication and server load in split computing.<n>Across standard vision and LLM workloads, SLICER reduces uplink volume by up to 10x and server GPU time by up to 4.4x.
arXiv Detail & Related papers (2025-11-03T08:44:13Z)
Efficient Low Rank Attention for Long-Context Inference in Large Language Models [41.24530756499533]
Low Rank Query and Key attention (LRQK) is a framework that decomposes the full-precision query and key matrices into compact rank-(r) factors during the prefill stage.<n>By selecting only the top-(k) tokens and a small fixed set of recent tokens, LRQK employs a mixed GPU- CPU cache with a hit-and-miss mechanism that transfers only missing full-precision KV pairs.
arXiv Detail & Related papers (2025-10-25T11:43:27Z)
LiVOS: Light Video Object Segmentation with Gated Linear Matching [116.58237547253935]
LiVOS is a lightweight memory network that employs linear matching via linear attention. For longer and higher-resolution videos, it matched STM-based methods with 53% less GPU memory and supports 4096p inference on a 32G consumer-grade GPU.
arXiv Detail & Related papers (2024-11-05T05:36:17Z)
MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels. It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup. MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z)
Tensor Slicing and Optimization for Multicore NPUs [2.670309629218727]
This paper proposes a compiler optimization pass for Multicore NPUs, called Slicing Optimization (TSO) TSO identifies the best tensor slicing that minimizes execution time for a set of CNN models. Results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models.
arXiv Detail & Related papers (2023-04-06T12:03:03Z)
Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel. Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU. Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z)
NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS) We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z)
SMASH: Sparse Matrix Atomic Scratchpad Hashing [0.0]
In this thesis, we introduce a novel SpGEMM kernel implementation based on the row-wise product approach. We leverage atomic instructions to merge intermediate partial products as they are generated. Our kernel can achieve 9.4x speedup as compared to competing approaches.
arXiv Detail & Related papers (2021-05-29T00:22:50Z)
FusedMM: A Unified SDDMM-SpMM Kernel for Graph Embedding and Graph Neural Networks [3.577310844634503]
We develop a fused matrix multiplication kernel that unifies sampled dense-dense matrix multiplication and sparse-dense matrix multiplication under a single operation called FusedMM. By using user-defined functions, FusedMM can capture almost all computational patterns needed by popular graph embedding and GNN approaches.
arXiv Detail & Related papers (2020-11-07T18:06:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.