Related papers: TiledAttention: a CUDA Tile SDPA Kernel for PyTorch

TiledAttention: a CUDA Tile SDPA Kernel for PyTorch

URL: http://arxiv.org/abs/2603.01960v1
Date: Mon, 02 Mar 2026 15:11:00 GMT
Title: TiledAttention: a CUDA Tile SDPA Kernel for PyTorch
Authors: Taimur Khan,
Abstract summary: TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs.<n>It is easier to modify than low-level templates while retaining realistic behavior via online softmax and tiled $K,V$ streaming.<n>We benchmarked TiledAttention on an NVIDIA DGX GB10 node with a reproducible harness and compare against PyTorch SDPA.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs. Implemented in cuTile Python (TileIR) and exposed as a PyTorch-callable function, it is easier to modify than low-level CUDA templates while retaining realistic behavior via online softmax and tiled $K,V$ streaming. The approach is both performant and directly editable at the schedule level from Python (tile shapes, staging, shared-memory layout), enabling rapid, reproducible kernel research without template-heavy CUDA/CUTLASS rewrites. We benchmark TiledAttention on an NVIDIA DGX GB10 node with a reproducible harness and compare against PyTorch SDPA (auto-dispatch) and explicit unfused baselines across sequence length, head dimension, and precision (FP16/BF16). While production fused baselines remain stronger overall, TiledAttention delivers large speedups over standard eager attention paths and is available for direct use within PyTorch workflows, providing a practical balance between performance and customizability.

Related papers

SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch [11.845589863914851]
We present softdtw-cuda-torch, an open-source PyTorch library for computing Soft Dynamic Time Warping on GPUs.<n>Our implementation addresses three key limitations of existing GPU implementations of SoftDTW.<n>The library supports arbitrary sequence lengths, full PyTorch autograd integration, and Soft-DTW Barycenter.
arXiv Detail & Related papers (2026-02-19T09:53:03Z)
VibeTensor: System Software for Deep Learning, Fully Generated by AI Agents [42.56489784841984]
"fully generated" refers to code provenance: implementation changes were produced and applied as agent-proposed diffs.<n>We describe the architecture, summarize the workflow used to produce and validate the system, and evaluate the artifact.
arXiv Detail & Related papers (2026-01-21T19:29:00Z)
AutoSAGE: Input-Aware CUDA Scheduling for Sparse GNN Aggregation (SpMM/SDDMM) and CSR Attention [52.20940151628735]
AutoSAGE is an input-aware scheduler that chooses tiling and mapping per input.<n>On Reddit OGBN-Products it achieves up to 4.7x kernel-level speedups.
arXiv Detail & Related papers (2025-11-17T18:25:51Z)
Stroke Lesion Segmentation in Clinical Workflows: A Modular, Lightweight, and Deployment-Ready Tool [0.08699280339422537]
Deep learning frameworks such as nnU-Net achieve state-of-the-art performance in brain lesion segmentation but remain difficult to deploy clinically.<n>We introduce textitStrokeSeg, a modular and lightweight framework that translates research-grade stroke lesion segmentation models into deployable applications.
arXiv Detail & Related papers (2025-10-28T12:56:48Z)
pySigLib -- Fast Signature-Based Computations on CPU and GPU [9.126976857662084]
We present pySigLib, a high-performance Python library offering optimised implementations of signatures and signature kernels on CPU and GPU.<n>We introduce a novel differentiation scheme for signature kernels that delivers accurate gradients at a fraction of the runtime of existing libraries.
arXiv Detail & Related papers (2025-09-12T18:00:14Z)
Online Dense Point Tracking with Streaming Memory [54.22820729477756]
Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video.<n>Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one.<n>We present a lightweight and fast model with textbfStreaming memory for dense textbfPOint textbfTracking and online video processing.
arXiv Detail & Related papers (2025-03-09T06:16:49Z)
Two-stream Beats One-stream: Asymmetric Siamese Network for Efficient Visual Tracking [54.124445709376154]
We propose a novel asymmetric Siamese tracker named textbfAsymTrack for efficient tracking.<n>Building on this architecture, we devise an efficient template modulation mechanism to inject crucial cues into the search features.<n>Experiments demonstrate that AsymTrack offers superior speed-precision trade-offs across different platforms.
arXiv Detail & Related papers (2025-03-01T14:44:54Z)
KernelBench: Can LLMs Write Efficient GPU Kernels? [36.4117525096377]
KernelBench is an open-source framework for evaluating language models' ability to write fast and correct kernels.<n>We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct.<n>Our experiments show that frontier reasoning models perform the best out of the box but still fall short overall.
arXiv Detail & Related papers (2025-02-14T19:30:53Z)
FlexiViT: One Model for All Patch Sizes [100.52574011880571]
Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost. We show that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes.
arXiv Detail & Related papers (2022-12-15T18:18:38Z)
Stochastic Gradient Descent without Full Data Shuffle [65.97105896033815]
CorgiPile is a hierarchical data shuffling strategy that avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. Our results show that CorgiPile can achieve comparable convergence rate with the full shuffle based SGD for both deep learning and generalized linear models.
arXiv Detail & Related papers (2022-06-12T20:04:31Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.