Related papers: SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch

SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch

URL: http://arxiv.org/abs/2602.17206v1
Date: Thu, 19 Feb 2026 09:53:03 GMT
Title: SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch
Authors: Ron Shapira Weber, Oren Freifeld,
Abstract summary: We present softdtw-cuda-torch, an open-source PyTorch library for computing Soft Dynamic Time Warping on GPUs.<n>Our implementation addresses three key limitations of existing GPU implementations of SoftDTW.<n>The library supports arbitrary sequence lengths, full PyTorch autograd integration, and Soft-DTW Barycenter.
Score: 11.845589863914851
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present softdtw-cuda-torch, an open-source PyTorch library for computing Soft Dynamic Time Warping (SoftDTW) on GPUs. Our implementation addresses three key limitations of existing GPU implementations of SoftDTW: a hard sequence-length cap of 1024, numerical instability in the backward pass for small smoothing parameters, and excessive GPU memory consumption from materializing pairwise distance tensors. We introduce (1) tiled anti-diagonal kernel execution that removes the sequence-length constraint, (2) a log-space back-ward pass that prevents floating-point overflow, and (3) a fused distance-computation mode that eliminates the O(BN M ) intermediate distance tensor, achieving up to 98% memory reduction compared to prior work. The library supports arbitrary sequence lengths, full PyTorch autograd integration, and Soft-DTW Barycenter computation. Code is available at https://github.com/BGU-CS-VIL/sdtw-cuda-torch.

Related papers

TiledAttention: a CUDA Tile SDPA Kernel for PyTorch [0.0]
TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs.<n>It is easier to modify than low-level templates while retaining realistic behavior via online softmax and tiled $K,V$ streaming.<n>We benchmarked TiledAttention on an NVIDIA DGX GB10 node with a reproducible harness and compare against PyTorch SDPA.
arXiv Detail & Related papers (2026-03-02T15:11:00Z)
Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention [63.69228529380251]
Spava is a sequence-parallel framework with optimized attention for long-video inference.<n>Spava delivers speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss.
arXiv Detail & Related papers (2026-01-29T09:23:13Z)
torch-sla: Differentiable Sparse Linear Algebra with Adjoint Solvers and Sparse Tensor Parallelism for PyTorch [0.2960141730774496]
We present torchsla, an open-source PyTorch library that enables GPU-accelerated, scalable, and differentiable sparse linear algebra.<n>torchsla supports multiple backends (SciPy, cuDSS, PyTorch-native) and seamlessly integrates with PyTorch autograd for end-to-end differentiable simulations.
arXiv Detail & Related papers (2026-01-20T14:06:01Z)
Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs [11.45717904490388]
Recent advances in transformer-based foundation models have made them the default choice for many tasks.<n>Their rapidly growing size makes fitting a full model on a single GPU increasingly difficult and their computational cost prohibitive.<n>Block low-rank (BLR) compression techniques address this challenge by learning compact representations of weight matrices.
arXiv Detail & Related papers (2025-12-24T00:41:13Z)
From Theory to Throughput: CUDA-Optimized APML for Large-Batch 3D Learning [8.063701386493289]
Chamfer Distance is efficient but permits many-to-one correspondences, while Earth Mover Distance better reflects one-to-one transport at high computational cost.<n>APML is a sparse implementation that thresholds negligible assignments and runs adaptive softmax, bidirectional symmetrization, and Sinkhorn preserves directly in COO form.
arXiv Detail & Related papers (2025-12-17T23:18:51Z)
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float [52.079202872069835]
Large-scale AI models, such as Large Language Models (LLMs) and Diffusion Models (DMs) have grown rapidly in size.<n>We introduce Dynamic-Length Float (DFloat11), a compression framework that reduces LLM and DM size by 30% while preserving outputs that are bit-for-bit identical to the original model.
arXiv Detail & Related papers (2025-04-15T22:38:38Z)
Keras Sig: Efficient Path Signature Computation on GPU in Keras 3 [0.0]
Keras Sig is a high-performance pythonic library designed to compute path signature for deep learning applications.<n> Entirely built in Keras 3, textitKeras Sig leverages the seamless integration with the mostly used deep learning backends such as PyTorch, JAX and GPU.
arXiv Detail & Related papers (2025-01-14T22:00:01Z)
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks. We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems. Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z)
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z)
Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z)
Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z)
Kernel Operations on the GPU, with Autodiff, without Memory Overflows [5.669790037378094]
The KeOps library provides a fast and memory-efficient GPU support for tensors whose entries are given by a mathematical formula. KeOps alleviates the major bottleneck of tensor-centric libraries for kernel and geometric applications: memory consumption. KeOps combines optimized C++/CUDA schemes with binders for high-level languages: Python (Numpy and PyTorch), Matlab and R.
arXiv Detail & Related papers (2020-03-27T08:54:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.