RAM: Replace Attention with MLP for Efficient Multivariate Time Series Forecasting
- URL: http://arxiv.org/abs/2410.24023v2
- Date: Sat, 10 May 2025 08:10:54 GMT
- Title: RAM: Replace Attention with MLP for Efficient Multivariate Time Series Forecasting
- Authors: Suhan Guo, Jiahong Deng, Yi Wei, Hui Dou, Furao Shen, Jian Zhao,
- Abstract summary: We propose a novel pruning strategy that approximates the attention mechanism using only feedforward layers, residual connections, and layer normalization.<n>RAM achieves a $62579%$ reduction in FLOPs fortemporal models with less than $2.5%$ performance drop, and a $42.233%$ FLOPs reduction fortemporal models with less than $2%$ performance drop.
- Score: 21.7023262988233
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention-based architectures have become ubiquitous in time series forecasting tasks, including spatio-temporal (STF) and long-term time series forecasting (LTSF). Yet, our understanding of the reasons for their effectiveness remains limited. In this work, we propose a novel pruning strategy, $\textbf{R}$eplace $\textbf{A}$ttention with $\textbf{M}$LP (RAM), that approximates the attention mechanism using only feedforward layers, residual connections, and layer normalization for temporal and/or spatial modeling in multivariate time series forecasting. Specifically, the Q, K, and V projections, the attention score calculation, the dot-product between the attention score and the V, and the final projection can be removed from the attention-based networks without significantly degrading the performance, so that the given network remains the top-tier compared to other SOTA methods. RAM achieves a $62.579\%$ reduction in FLOPs for spatio-temporal models with less than $2.5\%$ performance drop, and a $42.233\%$ FLOPs reduction for LTSF models with less than $2\%$ performance drop.
Related papers
- SPAT: Sensitivity-based Multihead-attention Pruning on Time Series Forecasting Models [8.817690876855728]
We propose a structured pruning method, SPAT ($textbfS$ensitivity $textbfP$runer for $textbfAt$tention), which selectively removes attention mechanisms and yields highly effective models.<n>Experiments on datasets demonstrate that SPAT-pruned models achieve reductions of 2.842% in MSE, 1.996% in MAE, and 35.274% in FLOPs.
arXiv Detail & Related papers (2025-05-13T17:39:31Z) - SWIFT: Mapping Sub-series with Wavelet Decomposition Improves Time Series Forecasting [2.6764607949560593]
$textitSWIFT$ is a lightweight model that is powerful, but also efficient in deployment and inference for Long-term Time Series Forecasting.<n>We conduct comprehensive experiments, and the results show that $textitSWIFT$ achieves state-of-the-art (SOTA) performance on multiple datasets.
arXiv Detail & Related papers (2025-01-27T16:26:07Z) - Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more-efficient metric for performance estimation.
We extend the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources.
We employ a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance.
arXiv Detail & Related papers (2024-10-11T04:57:48Z) - MixLinear: Extreme Low Resource Multivariate Time Series Forecasting with 0.1K Parameters [6.733646592789575]
Long-term Time Series Forecasting (LTSF) involves predicting long-term values by analyzing a large amount of historical time-series data to identify patterns and trends.
Transformer-based models offer high forecasting accuracy, but they are often too compute-intensive to be deployed on devices with hardware constraints.
We propose MixLinear, an ultra-lightweight time series forecasting model specifically designed for resource-constrained devices.
arXiv Detail & Related papers (2024-10-02T23:04:57Z) - Boosting MLPs with a Coarsening Strategy for Long-Term Time Series Forecasting [6.481470306093991]
Deep learning methods have been exerting their strengths in long-term time series forecasting.
They often struggle to strike a balance between expressive power and computational efficiency.
Here, we propose a coarsening strategy that alleviates the problems associated with the prototypes by forming information granules in place of solitary temporal points.
Based purely on convolutions of structural simplicity, CP-Net is able to maintain a linear computational complexity and low runtime, while demonstrating an improvement of 4.1% compared with the SOTA method on seven forecasting benchmarks.
arXiv Detail & Related papers (2024-05-06T06:47:44Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - HiMTM: Hierarchical Multi-Scale Masked Time Series Modeling with Self-Distillation for Long-Term Forecasting [17.70984737213973]
HiMTM is a hierarchical multi-scale masked time series modeling with self-distillation for long-term forecasting.
HiMTM integrates four key components: (1) hierarchical multi-scale transformer (HMT) to capture temporal information at different scales; (2) decoupled encoder-decoder (DED) that directs the encoder towards feature extraction while the decoder focuses on pretext tasks.
Experiments on seven mainstream datasets show that HiMTM surpasses state-of-the-art self-supervised and end-to-end learning methods by a considerable margin of 3.16-68.54%.
arXiv Detail & Related papers (2024-01-10T09:00:03Z) - Short-Term Multi-Horizon Line Loss Rate Forecasting of a Distribution
Network Using Attention-GCN-LSTM [9.460123100630158]
We propose Attention-GCN-LSTM, a novel method that combines Graph Convolutional Networks (GCN), Long Short-Term Memory (LSTM) and a three-level attention mechanism.
Our model enables accurate forecasting of line loss rates across multiple horizons.
arXiv Detail & Related papers (2023-12-19T06:47:22Z) - TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models [52.454274602380124]
Diffusion models heavily depend on the time-step $t$ to achieve satisfactory multi-round denoising.
We propose a Temporal Feature Maintenance Quantization (TFMQ) framework building upon a Temporal Information Block.
Powered by the pioneering block design, we devise temporal information aware reconstruction (TIAR) and finite set calibration (FSC) to align the full-precision temporal features.
arXiv Detail & Related papers (2023-11-27T12:59:52Z) - Frequency-domain MLPs are More Effective Learners in Time Series
Forecasting [67.60443290781988]
Time series forecasting has played the key role in different industrial domains, including finance, traffic, energy, and healthcare.
Most-based forecasting methods suffer from the point-wise mappings and information bottleneck.
We propose FreTS, a simple yet effective architecture built upon Frequency-domains for Time Series forecasting.
arXiv Detail & Related papers (2023-11-10T17:05:13Z) - Hierarchical Forecasting at Scale [55.658563862299495]
Existing hierarchical forecasting techniques scale poorly when the number of time series increases.
We propose to learn a coherent forecast for millions of time series with a single bottom-level forecast model.
We implement our sparse hierarchical loss function within an existing forecasting model at bol, a large European e-commerce platform.
arXiv Detail & Related papers (2023-10-19T15:06:31Z) - A Distance Correlation-Based Approach to Characterize the Effectiveness of Recurrent Neural Networks for Time Series Forecasting [1.9950682531209158]
We provide an approach to link time series characteristics with RNN components via the versatile metric of distance correlation.
We empirically show that the RNN activation layers learn the lag structures of time series well.
We also show that the activation layers cannot adequately model moving average and heteroskedastic time series processes.
arXiv Detail & Related papers (2023-07-28T22:32:08Z) - Unlocking the Potential of Deep Learning in Peak-Hour Series Forecasting [19.396667925659507]
This paper presents Seq2Peak, a novel framework designed specifically for Peak-Hour Series Forecasting (PHSF) tasks.
It offers two key components: the CyclicNorm pipeline to mitigate the non-stationarity issue and a simple yet effective trainable- parameter-free peak-hour decoder.
Experiments on publicly available time series datasets demonstrate the effectiveness of the proposed framework.
arXiv Detail & Related papers (2023-07-04T09:38:38Z) - CARD: Channel Aligned Robust Blend Transformer for Time Series
Forecasting [50.23240107430597]
We design a special Transformer, i.e., Channel Aligned Robust Blend Transformer (CARD for short), that addresses key shortcomings of CI type Transformer in time series forecasting.
First, CARD introduces a channel-aligned attention structure that allows it to capture both temporal correlations among signals.
Second, in order to efficiently utilize the multi-scale knowledge, we design a token blend module to generate tokens with different resolutions.
Third, we introduce a robust loss function for time series forecasting to alleviate the potential overfitting issue.
arXiv Detail & Related papers (2023-05-20T05:16:31Z) - Short-Term Electricity Price Forecasting based on Graph Convolution
Network and Attention Mechanism [5.331757100806177]
This paper tailors a spectral graph convolutional network (GCN) to greatly improve the accuracy of short-term LMP forecasting.
A three-branch network structure is then designed to match the structure of LMPs' compositions.
Case studies based on the IEEE-118 test system and real-world data from the PJM validate that the proposed model outperforms existing forecasting models in accuracy.
arXiv Detail & Related papers (2021-07-26T15:44:07Z) - A Novel Approach for Classification and Forecasting of Time Series in
Particle Accelerators [52.77024349608834]
A novel time series classification approach is applied to decrease beam time loss in the High Intensity Proton Accelerator complex.
Our best performing interlock-to-stable classifier reaches an Area under the ROC Curve value of $0.71 pm 0.01$ compared to $0.65 pm 0.01$ of a Random Forest model.
arXiv Detail & Related papers (2021-02-01T11:53:14Z) - Learning Monocular Visual Odometry via Self-Supervised Long-Term
Modeling [106.15327903038705]
Monocular visual odometry (VO) suffers severely from error accumulation during frame-to-frame pose estimation.
We present a self-supervised learning method for VO with special consideration for consistency over longer sequences.
We train the networks with purely self-supervised losses, including a cycle consistency loss that mimics the loop closure module in geometric VO.
arXiv Detail & Related papers (2020-07-21T17:59:01Z) - Deep Stock Predictions [58.720142291102135]
We consider the design of a trading strategy that performs portfolio optimization using Long Short Term Memory (LSTM) neural networks.
We then customize the loss function used to train the LSTM to increase the profit earned.
We find the LSTM model with the customized loss function to have an improved performance in the training bot over a regressive baseline such as ARIMA.
arXiv Detail & Related papers (2020-06-08T23:37:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.