Related papers: FLASH-D: FlashAttention with Hidden Softmax Division

FLASH-D: FlashAttention with Hidden Softmax Division

URL: http://arxiv.org/abs/2505.14201v1
Date: Tue, 20 May 2025 11:01:33 GMT
Title: FLASH-D: FlashAttention with Hidden Softmax Division
Authors: Kosmas Alexandridis, Vasileios Titopoulos, Giorgos Dimitrakopoulos,
Abstract summary: Building on online softmax computation, FlashAttention integrates softmax calculation with matrix arithmetic.<n>This work presents FLASH-D a mathematically equivalent, yet simplified, formulation that achieves: (a) hiding softmax division within other non-linear function evaluations; (b) inherently numerically stable computation of exponentials; and (c) a reduction in computational cost without introducing numerical approximations to the FlashAttention kernel.
Score: 3.668018928502405
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The transformer's attention mechanism has revolutionized AI and machine learning, with its efficient computation being crucial to its performance. However, calculating attention involves matrix operations interspersed with softmax rescaling, which inherently slows down computation and requires processing the entire input sequence. Building on online softmax computation, FlashAttention integrates softmax calculation with matrix arithmetic, enabling tiled computation independent of sequence length. While optimized for GPUs, FlashAttention's simplicity makes it amenable to direct hardware acceleration. This work re-evaluates the core FlashAttention kernel, presenting FLASH-D a mathematically equivalent, yet simplified, formulation that achieves: (a) hiding softmax division within other non-linear function evaluations; (b) inherently numerically stable computation of exponentials, eliminating the need for maximum value subtraction; and (c) a reduction in computational cost without introducing numerical approximations to the FlashAttention kernel. Importantly, the essential FlashAttention properties that facilitate efficient tiled implementation are fully preserved. Hardware implementation results at 28nm demonstrate that this proposed formulation achieves a 22.8% reduction in area and a 20.3% reduction in power, on average, compared to state-of-the-art parallel hardware architectures without any performance penalty.

Related papers

Orthogonal Finetuning Made Scalable [87.49040247077389]
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment.<n>We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity.<n>We propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic.<n>These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance.
arXiv Detail & Related papers (2025-06-24T17:59:49Z)
Low-Cost FlashAttention with Fused Exponential and Multiplication Hardware Operators [3.668018928502405]
We focus on optimizing the kernel of floating-point-based FlashAttention using new hardware operators that fuse the computation of exponentials and vector multiplications.<n>The proposed ExpMul hardware operators significantly reduce the area and power costs of FlashAttention-based hardware accelerators.
arXiv Detail & Related papers (2025-05-20T13:00:59Z)
VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers [13.984340807378457]
Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step.<n>We design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph's method.<n>We execute Softmax with 162.7$times$ less latency and 74.3$times$ less energy compared to the baseline cluster.
arXiv Detail & Related papers (2025-04-15T14:28:48Z)
FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores [6.404201720333765]
We propose FlashSparse, a novel approach to bridge the gap between sparse workloads and the TCU architecture.<n>Specifically, FlashSparse minimizes the sparse granularity for SpMM and SDDMM on TCUs through a novel swap-and-transpose matrix multiplication strategy.<n>We show that FlashSparse sets a new state-of-the-art for sparse matrix multiplications (geometric mean 5.5x speedup over DTC-SpMM and 3.22x speedup over RoDe)
arXiv Detail & Related papers (2024-12-15T01:12:33Z)
FlashMask: Efficient and Rich Mask Extension of FlashAttention [22.810595298076866]
FlashMask is an extension of FlashAttention that introduces a column-wise sparse representation of attention masks.<n>By adopting this novel representation, FlashMask achieves linear memory complexity $O(N)$, suitable for modeling long-context sequences.<n>We evaluate FlashMask's performance in fine-tuning and alignment training of LLMs such as SFT, LoRA, DPO, and RM.
arXiv Detail & Related papers (2024-10-02T09:17:26Z)
Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel. Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU. Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z)
Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. Existing methods are either theoretically flawed or empirically ineffective for visual recognition. We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z)
SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention. Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing. We identify that their limitations are rooted in keeping the softmax self-attention during approximations. For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z)
Minimax Optimization with Smooth Algorithmic Adversaries [59.47122537182611]
We propose a new algorithm for the min-player against smooth algorithms deployed by an adversary. Our algorithm is guaranteed to make monotonic progress having no limit cycles, and to find an appropriate number of gradient ascents.
arXiv Detail & Related papers (2021-06-02T22:03:36Z)
Memristive Stochastic Computing for Deep Learning Parameter Optimization [1.6344851071810071]
Computing (SC) is a computing paradigm that allows for the low-cost and low-power of various arithmetic operations using bit streams and digital logic. We demonstrate that in using a 40-nm Complementary Metal Oxide Semiconductor (CMOS) process our scalable architecture occupies 1.55mm$2$ and consumes approximately 167$mu$W when optimizing parameters of a Convolutional Neural Network (CNN) while it is being trained for a character recognition task, observing no notable reduction in accuracy post-training.
arXiv Detail & Related papers (2021-03-11T07:10:32Z)
Efficient Learning of Generative Models via Finite-Difference Score Matching [111.55998083406134]
We present a generic strategy to efficiently approximate any-order directional derivative with finite difference. Our approximation only involves function evaluations, which can be executed in parallel, and no gradient computations.
arXiv Detail & Related papers (2020-07-07T10:05:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.