TiledAttention: a CUDA Tile SDPA Kernel for PyTorch
- URL: http://arxiv.org/abs/2603.01960v1
- Date: Mon, 02 Mar 2026 15:11:00 GMT
- Title: TiledAttention: a CUDA Tile SDPA Kernel for PyTorch
- Authors: Taimur Khan,
- Abstract summary: TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs.<n>It is easier to modify than low-level templates while retaining realistic behavior via online softmax and tiled $K,V$ streaming.<n>We benchmarked TiledAttention on an NVIDIA DGX GB10 node with a reproducible harness and compare against PyTorch SDPA.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs. Implemented in cuTile Python (TileIR) and exposed as a PyTorch-callable function, it is easier to modify than low-level CUDA templates while retaining realistic behavior via online softmax and tiled $K,V$ streaming. The approach is both performant and directly editable at the schedule level from Python (tile shapes, staging, shared-memory layout), enabling rapid, reproducible kernel research without template-heavy CUDA/CUTLASS rewrites. We benchmark TiledAttention on an NVIDIA DGX GB10 node with a reproducible harness and compare against PyTorch SDPA (auto-dispatch) and explicit unfused baselines across sequence length, head dimension, and precision (FP16/BF16). While production fused baselines remain stronger overall, TiledAttention delivers large speedups over standard eager attention paths and is available for direct use within PyTorch workflows, providing a practical balance between performance and customizability.
Related papers
- SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch [11.845589863914851]
We present softdtw-cuda-torch, an open-source PyTorch library for computing Soft Dynamic Time Warping on GPUs.<n>Our implementation addresses three key limitations of existing GPU implementations of SoftDTW.<n>The library supports arbitrary sequence lengths, full PyTorch autograd integration, and Soft-DTW Barycenter.
arXiv Detail & Related papers (2026-02-19T09:53:03Z) - VibeTensor: System Software for Deep Learning, Fully Generated by AI Agents [42.56489784841984]
"fully generated" refers to code provenance: implementation changes were produced and applied as agent-proposed diffs.<n>We describe the architecture, summarize the workflow used to produce and validate the system, and evaluate the artifact.
arXiv Detail & Related papers (2026-01-21T19:29:00Z) - AutoSAGE: Input-Aware CUDA Scheduling for Sparse GNN Aggregation (SpMM/SDDMM) and CSR Attention [52.20940151628735]
AutoSAGE is an input-aware scheduler that chooses tiling and mapping per input.<n>On Reddit OGBN-Products it achieves up to 4.7x kernel-level speedups.
arXiv Detail & Related papers (2025-11-17T18:25:51Z) - Stroke Lesion Segmentation in Clinical Workflows: A Modular, Lightweight, and Deployment-Ready Tool [0.08699280339422537]
Deep learning frameworks such as nnU-Net achieve state-of-the-art performance in brain lesion segmentation but remain difficult to deploy clinically.<n>We introduce textitStrokeSeg, a modular and lightweight framework that translates research-grade stroke lesion segmentation models into deployable applications.
arXiv Detail & Related papers (2025-10-28T12:56:48Z) - pySigLib -- Fast Signature-Based Computations on CPU and GPU [9.126976857662084]
We present pySigLib, a high-performance Python library offering optimised implementations of signatures and signature kernels on CPU and GPU.<n>We introduce a novel differentiation scheme for signature kernels that delivers accurate gradients at a fraction of the runtime of existing libraries.
arXiv Detail & Related papers (2025-09-12T18:00:14Z) - Online Dense Point Tracking with Streaming Memory [54.22820729477756]
Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video.<n>Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one.<n>We present a lightweight and fast model with textbfStreaming memory for dense textbfPOint textbfTracking and online video processing.
arXiv Detail & Related papers (2025-03-09T06:16:49Z) - Two-stream Beats One-stream: Asymmetric Siamese Network for Efficient Visual Tracking [54.124445709376154]
We propose a novel asymmetric Siamese tracker named textbfAsymTrack for efficient tracking.<n>Building on this architecture, we devise an efficient template modulation mechanism to inject crucial cues into the search features.<n>Experiments demonstrate that AsymTrack offers superior speed-precision trade-offs across different platforms.
arXiv Detail & Related papers (2025-03-01T14:44:54Z) - KernelBench: Can LLMs Write Efficient GPU Kernels? [36.4117525096377]
KernelBench is an open-source framework for evaluating language models' ability to write fast and correct kernels.<n>We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct.<n>Our experiments show that frontier reasoning models perform the best out of the box but still fall short overall.
arXiv Detail & Related papers (2025-02-14T19:30:53Z) - FlexiViT: One Model for All Patch Sizes [100.52574011880571]
Vision Transformers convert images to sequences by slicing them into patches.
The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost.
We show that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes.
arXiv Detail & Related papers (2022-12-15T18:18:38Z) - Stochastic Gradient Descent without Full Data Shuffle [65.97105896033815]
CorgiPile is a hierarchical data shuffling strategy that avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed.
Our results show that CorgiPile can achieve comparable convergence rate with the full shuffle based SGD for both deep learning and generalized linear models.
arXiv Detail & Related papers (2022-06-12T20:04:31Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.