Related papers: H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences

H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences

URL: http://arxiv.org/abs/2107.11906v1
Date: Sun, 25 Jul 2021 23:07:03 GMT
Title: H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences
Authors: Zhenhai Zhu and Radu Soricut
Abstract summary: We describe an efficient hierarchical method to compute attention in the Transformer architecture. Our method is superior to alternative sub-quadratic proposals by over +6 points on average on the Long Range Arena benchmark. It also sets a new SOTA test perplexity on One-Billion Word dataset with 5x fewer model parameters than that of the previous-best Transformer-based models.
Score: 16.59989033959959
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We describe an efficient hierarchical method to compute attention in the Transformer architecture. The proposed attention mechanism exploits a matrix structure similar to the Hierarchical Matrix (H-Matrix) developed by the numerical analysis community, and has linear run time and memory complexity. We perform extensive experiments to show that the inductive bias embodied by our hierarchical attention is effective in capturing the hierarchical structure in the sequences typical for natural language and vision tasks. Our method is superior to alternative sub-quadratic proposals by over +6 points on average on the Long Range Arena benchmark. It also sets a new SOTA test perplexity on One-Billion Word dataset with 5x fewer model parameters than that of the previous-best Transformer-based models.

Related papers

Structured Linear CDEs: Maximally Expressive and Parallel-in-Time Sequence Models [6.389310720722303]
We provide a unifying framework for sequence models with structured, input-dependent state-transition matrices.<n>We prove that, unlike the diagonal state-transition matrices of S4 and Mamba, SLiCEs employ block-diagonal, sparse, or Walsh--Hadamard matrices.<n> Empirically, SLiCEs solve the $A_5$ state-tracking benchmark with a single layer, achieve best-in-class length generalisation on regular language tasks among parallel-in-time models, and match the state-of-the-art performance of log neural controlled differential equations.
arXiv Detail & Related papers (2025-05-23T11:34:21Z)
Unified Sparse-Matrix Representations for Diverse Neural Architectures [0.0]
We introduce a unified matrix-order framework that casts convolutional, recurrent and self-attention operations as sparse matrix multiplications.<n>This work establishes a mathematically rigorous substrate for diverse neural architectures and opens avenues for principled, hardware-aware network design.
arXiv Detail & Related papers (2025-05-11T06:26:34Z)
Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation [53.88562288388169]
A common strategy for. Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViTs) involves adapting the model to downstream tasks. We propose a novel PEFT approach inspired by Singular Value Decomposition (SVD) for representing the adaptation matrix. SVD decomposes a matrix into the product of a left unitary matrix, a diagonal matrix of scaling values, and a right unitary matrix.
arXiv Detail & Related papers (2024-10-30T12:08:30Z)
LOCAL: Learning with Orientation Matrix to Infer Causal Structure from Time Series Data [51.47827479376251]
LOCAL is a highly efficient, easy-to-implement, and constraint-free method for recovering dynamic causal structures. Asymptotic Causal Learning Mask (ACML) and Dynamic Graph Learning (DGPL) Experiments on synthetic and real-world datasets demonstrate that LOCAL significantly outperforms existing methods.
arXiv Detail & Related papers (2024-10-25T10:48:41Z)
Sliceformer: Make Multi-head Attention as Simple as Sorting in Discriminative Tasks [32.33355192614434]
We propose an effective and efficient surrogate of the Transformer, called Sliceformer. Our Sliceformer replaces the classic MHA mechanism with an extremely simple slicing-sorting'' operation. Our Sliceformer achieves comparable or better performance with lower memory cost and faster speed than the Transformer and its variants.
arXiv Detail & Related papers (2023-10-26T14:43:07Z)
An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z)
Deep Unrolling for Nonconvex Robust Principal Component Analysis [75.32013242448151]
We design algorithms for Robust Component Analysis (A) It consists in decomposing a matrix into the sum of a low Principaled matrix and a sparse Principaled matrix.
arXiv Detail & Related papers (2023-07-12T03:48:26Z)
Mode-wise Principal Subspace Pursuit and Matrix Spiked Covariance Model [13.082805815235975]
We introduce a novel framework called Mode-wise Principal Subspace Pursuit (MOP-UP) to extract hidden variations in both the row and column dimensions for matrix data. The effectiveness and practical merits of the proposed framework are demonstrated through experiments on both simulated and real datasets.
arXiv Detail & Related papers (2023-07-02T13:59:47Z)
Classification of BCI-EEG based on augmented covariance matrix [0.0]
We propose a new framework based on the augmented covariance extracted from an autoregressive model to improve motor imagery classification. We will test our approach on several datasets and several subjects using the MOABB framework.
arXiv Detail & Related papers (2023-02-09T09:04:25Z)
Sketching as a Tool for Understanding and Accelerating Self-attention for Long Sequences [52.6022911513076]
Transformer-based models are not efficient in processing long sequences due to the quadratic space and time complexity of the self-attention modules. We propose Linformer and Informer to reduce the quadratic complexity to linear (modulo logarithmic factors) via low-dimensional projection and row selection. Based on the theoretical analysis, we propose Skeinformer to accelerate self-attention and further improve the accuracy of matrix approximation to self-attention.
arXiv Detail & Related papers (2021-12-10T06:58:05Z)
ORCHARD: A Benchmark For Measuring Systematic Generalization of Multi-Hierarchical Reasoning [8.004425059996963]
We show that Transformer and LSTM models surprisingly fail in systematic generalization. We also show that with increased references between hierarchies, Transformer performs no better than random.
arXiv Detail & Related papers (2021-11-28T03:11:37Z)
Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model [58.17021225930069]
We explain the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA) We propose a more efficient EAT model, and design task-related heads to deal with different tasks more flexibly. Our approach achieves state-of-the-art results on the ImageNet classification task compared with recent vision transformer works.
arXiv Detail & Related papers (2021-05-31T16:20:03Z)
Multi-View Spectral Clustering with High-Order Optimal Neighborhood Laplacian Matrix [57.11971786407279]
Multi-view spectral clustering can effectively reveal the intrinsic cluster structure among data. This paper proposes a multi-view spectral clustering algorithm that learns a high-order optimal neighborhood Laplacian matrix. Our proposed algorithm generates the optimal Laplacian matrix by searching the neighborhood of the linear combination of both the first-order and high-order base.
arXiv Detail & Related papers (2020-08-31T12:28:40Z)
Hyperparameter optimization with REINFORCE and Transformers [2.1404235519012076]
Reinforcement Learning has yielded promising results for Neural Architecture Search (NAS) We demonstrate how its performance can be improved by using a simplified Transformer block to model the policy network.
arXiv Detail & Related papers (2020-06-01T13:35:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.