Related papers: Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU

Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU

URL: http://arxiv.org/abs/2506.06095v1
Date: Fri, 06 Jun 2025 13:54:34 GMT
Title: Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU
Authors: Wenhao Dai, Haodong Deng, Mengfei Rong, Xinyu Yang, Hongyu Liu, Fangxin Liu, Hailong Yang, Weifeng Liu, Qingxiao Sun,
Abstract summary: We propose STOF, a framework that incorporates optimizations for Sparse Transformer via flexible masking and operator fusion on GPU.<n>We show that STOF achieves maximum speedups of 1.7x in MHA computation and 1.5x in end-to-end inference.
Score: 18.470239387359094
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask layers introduce sparsity into Transformer to reduce calculations. However, previous works rarely focus on the performance optimization of sparse Transformer. Moreover, rule-based mechanisms ignore the fusion opportunities of mixed-type operators and fail to adapt to various sequence lengths. To address the above problems, we propose STOF, a framework that incorporates optimizations for Sparse Transformer via flexible masking and operator fusion on GPU. We firstly unify the storage format and kernel implementation for the multi-head attention. Then, we map fusion schemes to compilation templates and determine the optimal parameter setting through a two-stage search engine. The experimental results show that compared to the state-of-the-art work, STOF achieves maximum speedups of 1.7x in MHA computation and 1.5x in end-to-end inference.

Related papers

Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon.<n>We replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
LLM Inference Acceleration via Efficient Operation Fusion [1.350507740574158]
Transformer-based Large Language Models (LLMs) contain hundreds of billions of parameters and require dedicated hardware resources for training and inference.<n>One of the key challenges inherent to the Transformer architecture is the requirement to support numerous non-linear transformations.<n>We propose an extremely efficient technique that can completely hide the overhead caused by such collective operations.
arXiv Detail & Related papers (2025-02-24T23:42:37Z)
MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations. Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality. No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z)
Transformer Fusion with Optimal Transport [25.022849817421964]
Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities. This paper presents a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components.
arXiv Detail & Related papers (2023-10-09T13:40:31Z)
Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST) CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background. Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z)
Efficient Mixed Transformer for Single Image Super-Resolution [1.7740376367999706]
Mixed Transformer Block (MTB) consists of multiple consecutive transformer layers. Pixel Mixer (PM) is used to replace the Self-Attention (SA) PM can enhance the local knowledge aggregation with pixel shifting operations.
arXiv Detail & Related papers (2023-05-19T03:19:38Z)
AMOM: Adaptive Masking over Masking for Conditional Masked Language Model [81.55294354206923]
A conditional masked language model (CMLM) is one of the most versatile frameworks. We introduce a simple yet effective adaptive masking over masking strategy to enhance the refinement capability of the decoder. Our proposed model yields state-of-the-art performance on neural machine translation.
arXiv Detail & Related papers (2023-03-13T20:34:56Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
Easy and Efficient Transformer : Scalable Inference Solution For large NLP mode [14.321889138798072]
This paper introduces a series of ultra-large-scale pre-training model optimization methods. An inference engine -- Easy and Efficient Transformer (EET) is proposed. EET achieves a 1.5-15x state-of-art speedup varying with context length.
arXiv Detail & Related papers (2021-04-26T11:00:56Z)
Glancing Transformer for Non-Autoregressive Neural Machine Translation [58.87258329683682]
We propose a method to learn word interdependency for single-pass parallel generation models. With only single-pass parallel decoding, GLAT is able to generate high-quality translation with 8-15 times speedup.
arXiv Detail & Related papers (2020-08-18T13:04:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.