Related papers: FuseSampleAgg: Fused Neighbor Sampling and Aggregation for Mini-batch GNNs

FuseSampleAgg: Fused Neighbor Sampling and Aggregation for Mini-batch GNNs

URL: http://arxiv.org/abs/2511.13645v1
Date: Mon, 17 Nov 2025 17:57:18 GMT
Title: FuseSampleAgg: Fused Neighbor Sampling and Aggregation for Mini-batch GNNs
Authors: Aleksandar Stanković,
Abstract summary: FuseSampleAgg fuses neighbor and sampling mean aggregation into a single pass for GraphSAGE.<n>Operator is deterministic, integrates with standard PyTorchs, and ships with scripts that reproduce all tables and figures from CSV logs.
Score: 51.56484100374058
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present FuseSampleAgg, a CUDA operator that fuses neighbor sampling and mean aggregation into a single pass for one and two hop GraphSAGE. By eliminating block materialization and extra kernel launches, FuseSampleAgg reduces memory traffic and overhead while preserving GraphSAGE mean semantics via saved index replay. Across the Reddit, ogbn-arxiv, and ogbn-products benchmarks (batch size 1024, automatic mixed precision enabled), we observe step time speedups up to 51x on ogbn-products, about 4x on Reddit with fanouts 10-10 and 15-10, and about 3.3x on ogbn-arxiv at larger fanouts, with peak GPU memory reductions up to 100x, 36x, and about 3.5x, respectively. The operator is deterministic, integrates with standard PyTorch optimizers, and ships with scripts that reproduce all tables and figures from CSV logs. Code and scripts are available at https://github.com/SV25-22/FuseSampleAgg.

Related papers

RFX: High-Performance Random Forests with GPU Acceleration and QLORA Compression [0.0]
RFX (Random Forests X), where X stands for compression or quantization, presents a production-ready implementation of Breiman and Cutler's Random Forest classification methodology in Python.<n>RFX v1.0 provides complete classification: out-of-bag error estimation, overall and local importance measures, proximity matrices with QLORA compression, case-wise analysis, and interactive visualization (rfviz)<n> Regression, unsupervised learning, CLIQUE importance, and RF-GAP proximity are planned for v2.0.
arXiv Detail & Related papers (2025-11-23T12:00:33Z)
AutoSAGE: Input-Aware CUDA Scheduling for Sparse GNN Aggregation (SpMM/SDDMM) and CSR Attention [52.20940151628735]
AutoSAGE is an input-aware scheduler that chooses tiling and mapping per input.<n>On Reddit OGBN-Products it achieves up to 4.7x kernel-level speedups.
arXiv Detail & Related papers (2025-11-17T18:25:51Z)
CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference [4.832840038837715]
CSV-Decode is a novel approach that uses geometric upper bounds to construct small sub-vocabularies for each decoding step.<n>Our method clusters vocabulary embeddings offline and uses centroid-plus-radius bounds to identify which tokens can be safely omitted from vocabularies.
arXiv Detail & Related papers (2025-11-16T14:02:41Z)
Auto-scaling Continuous Memory for GUI Agent [35.84598737971337]
Prior GUI agents compress past trajectories into text tokens, which balloons context length and misses decisive visual cues.<n>We propose a continuous memory that encodes each GUI trajectory into a fixed-length sequence of continuous embeddings.<n>As memory size and retrieval depth increase, performance improves monotonically, unlike text memories that degrade with long prompts.
arXiv Detail & Related papers (2025-10-10T06:16:45Z)
STAT: Shrinking Transformers After Training [72.0726371426711]
We present STAT, a simple algorithm to prune transformer models without any fine-tuning. STAT eliminates both attention heads and neurons from the network, while preserving accuracy by calculating a correction to the weights of the next layer. Our entire algorithm takes minutes to compress BERT, and less than three hours to compress models with 7B parameters using a single GPU.
arXiv Detail & Related papers (2024-05-29T22:59:11Z)
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [23.633481089469836]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.<n>We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.<n>Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z)
Distributed Matrix-Based Sampling for Graph Neural Network Training [0.0]
We propose a matrix-based bulk sampling approach that expresses sampling as a sparse matrix multiplication (SpGEMM) and samples multiple minibatches at once. When the input graph topology does not fit on a single device, our method distributes the graph and use communication-avoiding SpGEMM algorithms to scale GNN minibatch sampling. In addition to new methods for sampling, we introduce a pipeline that uses our matrix-based bulk sampling approach to provide end-to-end training results.
arXiv Detail & Related papers (2023-11-06T06:40:43Z)
Towards Memory-Efficient Training for Extremely Large Output Spaces -- Learning with 500k Labels on a Single Commodity GPU [2.3224617218247134]
In classification problems with large output spaces (up to millions of labels), the last layer can require an enormous amount of memory. Using sparse connectivity would drastically reduce the memory requirements, but it can result in much diminished predictive performance of the model. We show that a proposed approach can scale to datasets with 670,000 labels on a single GPU with only 4GB memory.
arXiv Detail & Related papers (2023-06-06T14:44:52Z)
NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS) We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z)
Fine-Grained Scene Graph Generation with Data Transfer [127.17675443137064]
Scene graph generation (SGG) aims to extract (subject, predicate, object) triplets in images. Recent works have made a steady progress on SGG, and provide useful tools for high-level vision and language understanding. We propose a novel Internal and External Data Transfer (IETrans) method, which can be applied in a play-and-plug fashion and expanded to large SGG with 1,807 predicate classes.
arXiv Detail & Related papers (2022-03-22T12:26:56Z)
Learning Tracking Representations via Dual-Branch Fully Transformer Networks [82.21771581817937]
We present a Siamese-like Dual-branch network based on solely Transformers for tracking. We extract a feature vector for each patch based on its matching results with others within an attention window. The method achieves better or comparable results as the best-performing methods.
arXiv Detail & Related papers (2021-12-05T13:44:33Z)
Accelerating Training and Inference of Graph Neural Networks with Fast Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement. We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment. We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler. We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.