Approximate attention with MLP: a pruning strategy for attention-based model in multivariate time series forecasting
- URL: http://arxiv.org/abs/2410.24023v1
- Date: Thu, 31 Oct 2024 15:23:34 GMT
- Title: Approximate attention with MLP: a pruning strategy for attention-based model in multivariate time series forecasting
- Authors: Suhan Guo, Jiahong Deng, Yi Wei, Hui Dou, Furao Shen, Jian Zhao,
- Abstract summary: This work proposes a new way to understand self-attention networks.
We show that the entire attention mechanism can be reduced to an degrading spatial network.
- Score: 21.7023262988233
- License:
- Abstract: Attention-based architectures have become ubiquitous in time series forecasting tasks, including spatio-temporal (STF) and long-term time series forecasting (LTSF). Yet, our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we have shown empirically that the entire attention mechanism in the encoder can be reduced to an MLP formed by feedforward, skip-connection, and layer normalization operations for temporal and/or spatial modeling in multivariate time series forecasting. Specifically, the Q, K, and V projection, the attention score calculation, the dot-product between the attention score and the V, and the final projection can be removed from the attention-based networks without significantly degrading the performance that the given network remains the top-tier compared to other SOTA methods. For spatio-temporal networks, the MLP-replace-attention network achieves a reduction in FLOPS of $62.579\%$ with a loss in performance less than $2.5\%$; for LTSF, a reduction in FLOPs of $42.233\%$ with a loss in performance less than $2\%$.
Related papers
- Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more-efficient metric for performance estimation.
We extend the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources.
We employ a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance.
arXiv Detail & Related papers (2024-10-11T04:57:48Z) - Boosting MLPs with a Coarsening Strategy for Long-Term Time Series Forecasting [6.481470306093991]
Deep learning methods have been exerting their strengths in long-term time series forecasting.
They often struggle to strike a balance between expressive power and computational efficiency.
Here, we propose a coarsening strategy that alleviates the problems associated with the prototypes by forming information granules in place of solitary temporal points.
Based purely on convolutions of structural simplicity, CP-Net is able to maintain a linear computational complexity and low runtime, while demonstrating an improvement of 4.1% compared with the SOTA method on seven forecasting benchmarks.
arXiv Detail & Related papers (2024-05-06T06:47:44Z) - HiMTM: Hierarchical Multi-Scale Masked Time Series Modeling with Self-Distillation for Long-Term Forecasting [17.70984737213973]
HiMTM is a hierarchical multi-scale masked time series modeling with self-distillation for long-term forecasting.
HiMTM integrates four key components: (1) hierarchical multi-scale transformer (HMT) to capture temporal information at different scales; (2) decoupled encoder-decoder (DED) that directs the encoder towards feature extraction while the decoder focuses on pretext tasks.
Experiments on seven mainstream datasets show that HiMTM surpasses state-of-the-art self-supervised and end-to-end learning methods by a considerable margin of 3.16-68.54%.
arXiv Detail & Related papers (2024-01-10T09:00:03Z) - Short-Term Multi-Horizon Line Loss Rate Forecasting of a Distribution
Network Using Attention-GCN-LSTM [9.460123100630158]
We propose Attention-GCN-LSTM, a novel method that combines Graph Convolutional Networks (GCN), Long Short-Term Memory (LSTM) and a three-level attention mechanism.
Our model enables accurate forecasting of line loss rates across multiple horizons.
arXiv Detail & Related papers (2023-12-19T06:47:22Z) - Frequency-domain MLPs are More Effective Learners in Time Series
Forecasting [67.60443290781988]
Time series forecasting has played the key role in different industrial domains, including finance, traffic, energy, and healthcare.
Most-based forecasting methods suffer from the point-wise mappings and information bottleneck.
We propose FreTS, a simple yet effective architecture built upon Frequency-domains for Time Series forecasting.
arXiv Detail & Related papers (2023-11-10T17:05:13Z) - Hierarchical Forecasting at Scale [55.658563862299495]
Existing hierarchical forecasting techniques scale poorly when the number of time series increases.
We propose to learn a coherent forecast for millions of time series with a single bottom-level forecast model.
We implement our sparse hierarchical loss function within an existing forecasting model at bol, a large European e-commerce platform.
arXiv Detail & Related papers (2023-10-19T15:06:31Z) - A Distance Correlation-Based Approach to Characterize the Effectiveness of Recurrent Neural Networks for Time Series Forecasting [1.9950682531209158]
We provide an approach to link time series characteristics with RNN components via the versatile metric of distance correlation.
We empirically show that the RNN activation layers learn the lag structures of time series well.
We also show that the activation layers cannot adequately model moving average and heteroskedastic time series processes.
arXiv Detail & Related papers (2023-07-28T22:32:08Z) - CARD: Channel Aligned Robust Blend Transformer for Time Series
Forecasting [50.23240107430597]
We design a special Transformer, i.e., Channel Aligned Robust Blend Transformer (CARD for short), that addresses key shortcomings of CI type Transformer in time series forecasting.
First, CARD introduces a channel-aligned attention structure that allows it to capture both temporal correlations among signals.
Second, in order to efficiently utilize the multi-scale knowledge, we design a token blend module to generate tokens with different resolutions.
Third, we introduce a robust loss function for time series forecasting to alleviate the potential overfitting issue.
arXiv Detail & Related papers (2023-05-20T05:16:31Z) - Short-Term Electricity Price Forecasting based on Graph Convolution
Network and Attention Mechanism [5.331757100806177]
This paper tailors a spectral graph convolutional network (GCN) to greatly improve the accuracy of short-term LMP forecasting.
A three-branch network structure is then designed to match the structure of LMPs' compositions.
Case studies based on the IEEE-118 test system and real-world data from the PJM validate that the proposed model outperforms existing forecasting models in accuracy.
arXiv Detail & Related papers (2021-07-26T15:44:07Z) - Learning Monocular Visual Odometry via Self-Supervised Long-Term
Modeling [106.15327903038705]
Monocular visual odometry (VO) suffers severely from error accumulation during frame-to-frame pose estimation.
We present a self-supervised learning method for VO with special consideration for consistency over longer sequences.
We train the networks with purely self-supervised losses, including a cycle consistency loss that mimics the loop closure module in geometric VO.
arXiv Detail & Related papers (2020-07-21T17:59:01Z) - Deep Stock Predictions [58.720142291102135]
We consider the design of a trading strategy that performs portfolio optimization using Long Short Term Memory (LSTM) neural networks.
We then customize the loss function used to train the LSTM to increase the profit earned.
We find the LSTM model with the customized loss function to have an improved performance in the training bot over a regressive baseline such as ARIMA.
arXiv Detail & Related papers (2020-06-08T23:37:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.