Related papers: An Efficient Data Reuse with Tile-Based Adaptive Stationary for Transformer Accelerators

An Efficient Data Reuse with Tile-Based Adaptive Stationary for Transformer Accelerators

URL: http://arxiv.org/abs/2503.19640v1
Date: Tue, 25 Mar 2025 13:29:58 GMT
Title: An Efficient Data Reuse with Tile-Based Adaptive Stationary for Transformer Accelerators
Authors: Tseng-Jen Li, Tian-Sheuan Chang,
Abstract summary: Transformer-based models have become the textitde facto backbone across many fields, such as computer vision and natural language processing.<n> external memory access (EMA) for weight and activations becomes a critical bottleneck due to its significantly higher energy consumption compared to internal computations.<n>We propose the Tile-based Adaptive Stationary scheme that selects the input or weight stationary in a tile, based on the input sequence length.
Score: 0.0502254944841629
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Transformer-based models have become the \textit{de facto} backbone across many fields, such as computer vision and natural language processing. However, as these models scale in size, external memory access (EMA) for weight and activations becomes a critical bottleneck due to its significantly higher energy consumption compared to internal computations. While most prior work has focused on optimizing the self-attention mechanism, little attention has been given to optimizing data transfer during linear projections, where EMA costs are equally important. In this paper, we propose the Tile-based Adaptive Stationary (TAS) scheme that selects the input or weight stationary in a tile granularity, based on the input sequence length. Our experimental results demonstrate that TAS can significantly reduce EMA by more than 97\% compared to traditional stationary schemes, while being compatible with various attention optimization techniques and hardware accelerators.

Related papers

POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation [57.57816409869894]
We introduce POET-X, a scalable and memory-efficient variant for training large language models.<n>PoET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency.
arXiv Detail & Related papers (2026-03-05T18:59:23Z)
Joint Channel Estimation and Computation Offloading in Fluid Antenna-assisted MEC Networks [81.36647816787713]
We propose an FA-assisted offloading framework to minimize the delay of channel estimation.<n>We show that the proposed system significantly reduces the accuracy under efficient communication.
arXiv Detail & Related papers (2025-09-16T08:48:44Z)
GENIAL: Generative Design Space Exploration via Network Inversion for Low Power Algorithmic Logic Units [1.5845117761091052]
We introduce GENIAL, a machine learning-based framework for the automatic generation and optimization of arithmetic units.<n>We show that GENIAL is consistently more sample efficient than other methods, and converges faster towards optimized designs.<n>We also demonstrate the versatility of our approach by achieving significant improvements on Finite State Machines.
arXiv Detail & Related papers (2025-07-25T06:34:59Z)
Optimal Transport Adapter Tuning for Bridging Modality Gaps in Few-Shot Remote Sensing Scene Classification [80.83325513157637]
Few-Shot Remote Sensing Scene Classification (FS-RSSC) presents the challenge of classifying remote sensing images with limited labeled samples.<n>We propose a novel Optimal Transport Adapter Tuning (OTAT) framework aimed at constructing an ideal Platonic representational space.
arXiv Detail & Related papers (2025-03-19T07:04:24Z)
Pruning-Based TinyML Optimization of Machine Learning Models for Anomaly Detection in Electric Vehicle Charging Infrastructure [8.29566258132752]
This paper investigates a pruning method for anomaly detection in resource-constrained environments, specifically targeting EVCI.<n> optimized models achieved significant reductions in model size and inference times, with only a marginal impact on their performance.<n> Notably, our findings indicate that, in the context of EVCI, pruning and FS can enhance computational efficiency while retaining critical anomaly detection capabilities.
arXiv Detail & Related papers (2025-03-19T00:18:37Z)
Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation.<n> Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
MATEY: multiscale adaptive foundation models for spatiotemporal physical systems [2.7767126393602726]
We propose two adaptive tokenization schemes that dynamically adjust patch sizes based on local features.<n>We evaluate the performance of a proposed multiscale adaptive model, MATEY, in a sequence of experiments.<n>We also demonstrate fine-tuning tasks featuring different physics that models pretrained on PDE data.
arXiv Detail & Related papers (2024-12-29T22:13:16Z)
Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves [123.07450481623124]
We propose Skip Tuning as a novel paradigm for adapting vision-language models to downstream tasks.<n>Unlike existing PT or adapter-based methods, Skip Tuning applies Layer-wise Skipping (LSkip) and Class-wise Skipping (CSkip) upon the FT baseline without introducing extra context vectors or adapter modules.
arXiv Detail & Related papers (2024-12-16T07:33:23Z)
PointMT: Efficient Point Cloud Analysis with Hybrid MLP-Transformer Architecture [46.266960248570086]
This study tackles the quadratic complexity of the self-attention mechanism by introducing a complexity local attention mechanism for effective feature aggregation. We also introduce a parameter-free channel temperature adaptation mechanism that adaptively adjusts the attention weight distribution in each channel. We show that PointMT achieves performance comparable to state-of-the-art methods while maintaining an optimal balance between performance and accuracy.
arXiv Detail & Related papers (2024-08-10T10:16:03Z)
SWAT: Scalable and Efficient Window Attention-based Transformers Acceleration on FPGAs [3.302913401404089]
Sliding window-based static sparse attention mitigates the problem by limiting the attention scope of the input tokens. We propose a dataflow-aware FPGA-based accelerator design, SWAT, that efficiently leverages the sparsity to achieve scalable performance for long input.
arXiv Detail & Related papers (2024-05-27T10:25:08Z)
Dimensionality Reduction in Sentence Transformer Vector Databases with Fast Fourier Transform [0.0]
Dimensionality reduction in vector databases is pivotal for streamlining AI data management. This paper explores the benefits of reducing vector database dimensions, with a focus on computational efficiency and overcoming the curse of dimensionality. We introduce a novel application of Fast Fourier Transform (FFT) to dimensionality reduction, a method previously underexploited in this context.
arXiv Detail & Related papers (2024-04-09T13:02:22Z)
Accelerator-driven Data Arrangement to Minimize Transformers Run-time on Multi-core Architectures [5.46396577345121]
complexity of transformer models in artificial intelligence expands their computational costs, memory usage, and energy consumption. We propose a novel memory arrangement strategy, governed by the hardware accelerator's kernel size, which effectively minimizes off-chip data access. Our approach can achieve up to a 2.8x speed increase when executing inferences employing state-of-the-art transformers.
arXiv Detail & Related papers (2023-12-20T13:01:25Z)
FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity. Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead. We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z)
Towards Long-Term Time-Series Forecasting: Feature, Pattern, and Distribution [57.71199089609161]
Long-term time-series forecasting (LTTF) has become a pressing demand in many applications, such as wind power supply planning. Transformer models have been adopted to deliver high prediction capacity because of the high computational self-attention mechanism. We propose an efficient Transformerbased model, named Conformer, which differentiates itself from existing methods for LTTF in three aspects.
arXiv Detail & Related papers (2023-01-05T13:59:29Z)
AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)
Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models. We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.