SPAT: Sensitivity-based Multihead-attention Pruning on Time Series Forecasting Models
- URL: http://arxiv.org/abs/2505.08768v1
- Date: Tue, 13 May 2025 17:39:31 GMT
- Title: SPAT: Sensitivity-based Multihead-attention Pruning on Time Series Forecasting Models
- Authors: Suhan Guo, Jiahong Deng, Mengjun Yi, Furao Shen, Jian Zhao,
- Abstract summary: We propose a structured pruning method, SPAT ($textbfS$ensitivity $textbfP$runer for $textbfAt$tention), which selectively removes attention mechanisms and yields highly effective models.<n>Experiments on datasets demonstrate that SPAT-pruned models achieve reductions of 2.842% in MSE, 1.996% in MAE, and 35.274% in FLOPs.
- Score: 8.817690876855728
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Attention-based architectures have achieved superior performance in multivariate time series forecasting but are computationally expensive. Techniques such as patching and adaptive masking have been developed to reduce their sizes and latencies. In this work, we propose a structured pruning method, SPAT ($\textbf{S}$ensitivity $\textbf{P}$runer for $\textbf{At}$tention), which selectively removes redundant attention mechanisms and yields highly effective models. Different from previous approaches, SPAT aims to remove the entire attention module, which reduces the risk of overfitting and enables speed-up without demanding specialized hardware. We propose a dynamic sensitivity metric, $\textbf{S}$ensitivity $\textbf{E}$nhanced $\textbf{N}$ormalized $\textbf{D}$ispersion (SEND) that measures the importance of each attention module during the pre-training phase. Experiments on multivariate datasets demonstrate that SPAT-pruned models achieve reductions of 2.842% in MSE, 1.996% in MAE, and 35.274% in FLOPs. Furthermore, SPAT-pruned models outperform existing lightweight, Mamba-based and LLM-based SOTA methods in both standard and zero-shot inference, highlighting the importance of retaining only the most effective attention mechanisms. We have made our code publicly available https://anonymous.4open.science/r/SPAT-6042.
Related papers
- Attention, Please! Revisiting Attentive Probing for Masked Image Modeling [20.39513629593113]
We introduce efficient probing (EP), a cross-attention mechanism that eliminates redundant projections, reduces the number of trainable parameters, and achieves up to a 10$times$ speed-up over conventional multi-head attention.<n>EP generalizes well beyond MIM to diverse pre-training paradigms, produces interpretable attention maps, and achieves strong gains in low-shot and layer-wise settings.
arXiv Detail & Related papers (2025-06-11T21:10:26Z) - FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA [61.79405341803085]
Low-Rank Adaptation (LoRA) is widely used for efficient fine-tuning of language models in federated learning (FL)<n>Low-Rank Adaptation (LoRA) is widely used for efficient fine-tuning of language models in federated learning (FL)
arXiv Detail & Related papers (2025-05-19T07:32:56Z) - Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z) - Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores [13.089178890203652]
This paper presents Samoyeds, an innovative acceleration system for MoE LLMs utilizing Sparse Cores (SpTCs)<n>It introduces a bespoke sparse data format tailored for MoE computation and develops a specialized sparse-sparse matrix multiplication kernel.<n> Evaluations show that Samoyeds outperforms SOTA works by up to 1.99$times$ at the kernel level and 1.58$times$ at the model level.
arXiv Detail & Related papers (2025-03-13T10:34:15Z) - RAM: Replace Attention with MLP for Efficient Multivariate Time Series Forecasting [21.7023262988233]
We propose a novel pruning strategy that approximates the attention mechanism using only feedforward layers, residual connections, and layer normalization.<n>RAM achieves a $62579%$ reduction in FLOPs fortemporal models with less than $2.5%$ performance drop, and a $42.233%$ FLOPs reduction fortemporal models with less than $2%$ performance drop.
arXiv Detail & Related papers (2024-10-31T15:23:34Z) - Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions [26.025283259518936]
Rodimus is a new type of attention system for Transformer-based large language models (LLMs)
Rodimus employs a data-dependent tempered selection mechanism within a linear attention-based, purely recurrent framework.
Our experiments demonstrate that Rodimus$+$-1.6B, trained on 1 trillion tokens, achieves superior downstream performance against models trained on more tokens.
arXiv Detail & Related papers (2024-10-09T06:22:36Z) - Temporal Feature Matters: A Framework for Diffusion Model Quantization [105.3033493564844]
Diffusion models rely on the time-step for the multi-round denoising.<n>We introduce a novel quantization framework that includes three strategies.<n>This framework preserves most of the temporal information and ensures high-quality end-to-end generation.
arXiv Detail & Related papers (2024-07-28T17:46:15Z) - Exploiting Pre-trained Models for Drug Target Affinity Prediction with Nearest Neighbors [58.661454334877256]
Drug-Target binding Affinity (DTA) prediction is essential for drug discovery.
Despite the application of deep learning methods to DTA prediction, the achieved accuracy remain suboptimal.
We propose $k$NN-DTA, a non-representation embedding-based retrieval method adopted on a pre-trained DTA prediction model.
arXiv Detail & Related papers (2024-07-21T15:49:05Z) - Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models [29.863953001061635]
Diffusion Models (DMs) have exhibited superior performance in generating high-quality and diverse images.
Existing works mainly adopt a retraining process to enhance DM efficiency.
We introduce the Attention-driven Training-free Efficient Diffusion Model (AT-EDM) framework that leverages attention maps to perform run-time pruning of redundant tokens.
arXiv Detail & Related papers (2024-05-08T17:56:47Z) - Advancing the Robustness of Large Language Models through Self-Denoised Smoothing [50.54276872204319]
Large language models (LLMs) have achieved significant success, but their vulnerability to adversarial perturbations has raised considerable concerns.
We propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions.
Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility.
arXiv Detail & Related papers (2024-04-18T15:47:00Z) - Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models [73.48675708831328]
We propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs)
The Efficient Attention Skipping (EAS) method evaluates the attention redundancy and skips the less important MHAs to speed up inference.
The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed.
arXiv Detail & Related papers (2024-03-22T14:20:34Z) - $λ$-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space [61.091910046492345]
$lambda$-ECLIPSE works in the latent space of a pre-trained CLIP model without relying on the diffusion UNet models.
$lambda$-ECLIPSE performs multisubject driven P-T2I with just 34M parameters and is trained on a mere 74 GPU hours.
arXiv Detail & Related papers (2024-02-07T19:07:10Z) - AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning [143.23123791557245]
Fine-tuning large pre-trained language models on downstream tasks has become an important paradigm in NLP.
We propose AdaLoRA, which adaptively allocates the parameter budget among weight matrices according to their importance score.
We conduct extensive experiments with several pre-trained models on natural language processing, question answering, and natural language generation to validate the effectiveness of AdaLoRA.
arXiv Detail & Related papers (2023-03-18T22:36:25Z) - Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs)
We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.