Sparse Binary Transformers for Multivariate Time Series Modeling
- URL: http://arxiv.org/abs/2308.04637v1
- Date: Wed, 9 Aug 2023 00:23:04 GMT
- Title: Sparse Binary Transformers for Multivariate Time Series Modeling
- Authors: Matt Gorbett, Hossein Shirazi, Indrakshi Ray
- Abstract summary: We show that lightweight Compressed Neural Networks can achieve accuracy comparable to dense floating-point Transformers.
Our model achieves favorable results across three time series learning tasks: classification, anomaly detection, and single-step forecasting.
We measure the computational savings of our approach over a range of metrics including parameter count, bit size, and floating point operation (FLOPs) count.
- Score: 1.3965477771846404
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compressed Neural Networks have the potential to enable deep learning across
new applications and smaller computational environments. However, understanding
the range of learning tasks in which such models can succeed is not well
studied. In this work, we apply sparse and binary-weighted Transformers to
multivariate time series problems, showing that the lightweight models achieve
accuracy comparable to that of dense floating-point Transformers of the same
structure. Our model achieves favorable results across three time series
learning tasks: classification, anomaly detection, and single-step forecasting.
Additionally, to reduce the computational complexity of the attention
mechanism, we apply two modifications, which show little to no decline in model
performance: 1) in the classification task, we apply a fixed mask to the query,
key, and value activations, and 2) for forecasting and anomaly detection, which
rely on predicting outputs at a single point in time, we propose an attention
mask to allow computation only at the current time step. Together, each
compression technique and attention modification substantially reduces the
number of non-zero operations necessary in the Transformer. We measure the
computational savings of our approach over a range of metrics including
parameter count, bit size, and floating point operation (FLOPs) count, showing
up to a 53x reduction in storage size and up to 10.5x reduction in FLOPs.
Related papers
- Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters.
We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers.
Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z) - Towards efficient deep autoencoders for multivariate time series anomaly
detection [0.8681331155356999]
We propose a novel compression method for deep autoencoders that involves three key factors.
First, pruning reduces the number of weights, while preventing catastrophic drops in accuracy by means of a fast search process.
Second, linear and non-linear quantization reduces model complexity by reducing the number of bits for every single weight.
arXiv Detail & Related papers (2024-03-04T19:22:09Z) - Mixed Precision Post Training Quantization of Neural Networks with
Sensitivity Guided Search [7.392278887917975]
Mixed-precision quantization allows different tensors to be quantized to varying levels of numerical precision.
We evaluate our method for computer vision and natural language processing and demonstrate latency reductions of up to 27.59% and 34.31%.
arXiv Detail & Related papers (2023-02-02T19:30:00Z) - Transform Once: Efficient Operator Learning in Frequency Domain [69.74509540521397]
We study deep neural networks designed to harness the structure in frequency domain for efficient learning of long-range correlations in space or time.
This work introduces a blueprint for frequency domain learning through a single transform: transform once (T1)
arXiv Detail & Related papers (2022-11-26T01:56:05Z) - The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in
Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.
We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks.
We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Rethinking Attention Mechanism in Time Series Classification [6.014777261874646]
We promote the efficiency and performance of the attention mechanism by proposing our flexible multi-head linear attention (FMLA)
We propose a simple but effective mask mechanism that helps reduce the noise influence in time series and decrease the redundancy of the proposed FMLA.
We conduct extensive experiments on 85 UCR2018 datasets to compare our algorithm with 11 well-known ones and the results show that our algorithm has comparable performance in terms of top-1 accuracy.
arXiv Detail & Related papers (2022-07-14T07:15:06Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - n-hot: Efficient bit-level sparsity for powers-of-two neural network
quantization [0.0]
Powers-of-two (PoT) quantization reduces the number of bit operations of deep neural networks on resource-constrained hardware.
PoT quantization triggers a severe accuracy drop because of its limited representation ability.
We propose an efficient PoT quantization scheme that balances accuracy and costs in a memory-efficient way.
arXiv Detail & Related papers (2021-03-22T10:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.