H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for
Sequences
- URL: http://arxiv.org/abs/2107.11906v1
- Date: Sun, 25 Jul 2021 23:07:03 GMT
- Title: H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for
Sequences
- Authors: Zhenhai Zhu and Radu Soricut
- Abstract summary: We describe an efficient hierarchical method to compute attention in the Transformer architecture.
Our method is superior to alternative sub-quadratic proposals by over +6 points on average on the Long Range Arena benchmark.
It also sets a new SOTA test perplexity on One-Billion Word dataset with 5x fewer model parameters than that of the previous-best Transformer-based models.
- Score: 16.59989033959959
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We describe an efficient hierarchical method to compute attention in the
Transformer architecture. The proposed attention mechanism exploits a matrix
structure similar to the Hierarchical Matrix (H-Matrix) developed by the
numerical analysis community, and has linear run time and memory complexity. We
perform extensive experiments to show that the inductive bias embodied by our
hierarchical attention is effective in capturing the hierarchical structure in
the sequences typical for natural language and vision tasks. Our method is
superior to alternative sub-quadratic proposals by over +6 points on average on
the Long Range Arena benchmark. It also sets a new SOTA test perplexity on
One-Billion Word dataset with 5x fewer model parameters than that of the
previous-best Transformer-based models.
Related papers
- Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation [53.88562288388169]
A common strategy for.
Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViTs) involves adapting the model to downstream tasks.
We propose a novel PEFT approach inspired by Singular Value Decomposition (SVD) for representing the adaptation matrix.
SVD decomposes a matrix into the product of a left unitary matrix, a diagonal matrix of scaling values, and a right unitary matrix.
arXiv Detail & Related papers (2024-10-30T12:08:30Z) - Sliceformer: Make Multi-head Attention as Simple as Sorting in
Discriminative Tasks [32.33355192614434]
We propose an effective and efficient surrogate of the Transformer, called Sliceformer.
Our Sliceformer replaces the classic MHA mechanism with an extremely simple slicing-sorting'' operation.
Our Sliceformer achieves comparable or better performance with lower memory cost and faster speed than the Transformer and its variants.
arXiv Detail & Related papers (2023-10-26T14:43:07Z) - An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks.
The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions.
We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z) - Deep Unrolling for Nonconvex Robust Principal Component Analysis [75.32013242448151]
We design algorithms for Robust Component Analysis (A)
It consists in decomposing a matrix into the sum of a low Principaled matrix and a sparse Principaled matrix.
arXiv Detail & Related papers (2023-07-12T03:48:26Z) - Mode-wise Principal Subspace Pursuit and Matrix Spiked Covariance Model [13.082805815235975]
We introduce a novel framework called Mode-wise Principal Subspace Pursuit (MOP-UP) to extract hidden variations in both the row and column dimensions for matrix data.
The effectiveness and practical merits of the proposed framework are demonstrated through experiments on both simulated and real datasets.
arXiv Detail & Related papers (2023-07-02T13:59:47Z) - Classification of BCI-EEG based on augmented covariance matrix [0.0]
We propose a new framework based on the augmented covariance extracted from an autoregressive model to improve motor imagery classification.
We will test our approach on several datasets and several subjects using the MOABB framework.
arXiv Detail & Related papers (2023-02-09T09:04:25Z) - Sketching as a Tool for Understanding and Accelerating Self-attention
for Long Sequences [52.6022911513076]
Transformer-based models are not efficient in processing long sequences due to the quadratic space and time complexity of the self-attention modules.
We propose Linformer and Informer to reduce the quadratic complexity to linear (modulo logarithmic factors) via low-dimensional projection and row selection.
Based on the theoretical analysis, we propose Skeinformer to accelerate self-attention and further improve the accuracy of matrix approximation to self-attention.
arXiv Detail & Related papers (2021-12-10T06:58:05Z) - ORCHARD: A Benchmark For Measuring Systematic Generalization of
Multi-Hierarchical Reasoning [8.004425059996963]
We show that Transformer and LSTM models surprisingly fail in systematic generalization.
We also show that with increased references between hierarchies, Transformer performs no better than random.
arXiv Detail & Related papers (2021-11-28T03:11:37Z) - Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model [58.17021225930069]
We explain the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA)
We propose a more efficient EAT model, and design task-related heads to deal with different tasks more flexibly.
Our approach achieves state-of-the-art results on the ImageNet classification task compared with recent vision transformer works.
arXiv Detail & Related papers (2021-05-31T16:20:03Z) - Multi-View Spectral Clustering with High-Order Optimal Neighborhood
Laplacian Matrix [57.11971786407279]
Multi-view spectral clustering can effectively reveal the intrinsic cluster structure among data.
This paper proposes a multi-view spectral clustering algorithm that learns a high-order optimal neighborhood Laplacian matrix.
Our proposed algorithm generates the optimal Laplacian matrix by searching the neighborhood of the linear combination of both the first-order and high-order base.
arXiv Detail & Related papers (2020-08-31T12:28:40Z) - Hyperparameter optimization with REINFORCE and Transformers [2.1404235519012076]
Reinforcement Learning has yielded promising results for Neural Architecture Search (NAS)
We demonstrate how its performance can be improved by using a simplified Transformer block to model the policy network.
arXiv Detail & Related papers (2020-06-01T13:35:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.