Higher Order Transformers: Efficient Attention Mechanism for Tensor Structured Data
- URL: http://arxiv.org/abs/2412.02919v1
- Date: Wed, 04 Dec 2024 00:10:47 GMT
- Title: Higher Order Transformers: Efficient Attention Mechanism for Tensor Structured Data
- Authors: Soroush Omranpour, Guillaume Rabusseau, Reihaneh Rabbany,
- Abstract summary: Higher-Order Transformers (HOT) are designed to process data with more than two axes, i.e. higher-order tensors.<n>To address the computational challenges associated with high-order tensor attention, we introduce a novel Kronecker factorized attention mechanism.<n>We validate the effectiveness of HOT on two high-dimensional tasks, including multivariate time series forecasting, and 3D medical image classification.
- Score: 10.327160288730125
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Transformers are now ubiquitous for sequence modeling tasks, but their extension to multi-dimensional data remains a challenge due to the quadratic cost of the attention mechanism. In this paper, we propose Higher-Order Transformers (HOT), a novel architecture designed to efficiently process data with more than two axes, i.e. higher-order tensors. To address the computational challenges associated with high-order tensor attention, we introduce a novel Kronecker factorized attention mechanism that reduces the attention cost to quadratic in each axis' dimension, rather than quadratic in the total size of the input tensor. To further enhance efficiency, HOT leverages kernelized attention, reducing the complexity to linear. This strategy maintains the model's expressiveness while enabling scalable attention computation. We validate the effectiveness of HOT on two high-dimensional tasks, including multivariate time series forecasting, and 3D medical image classification. Experimental results demonstrate that HOT achieves competitive performance while significantly improving computational efficiency, showcasing its potential for tackling a wide range of complex, multi-dimensional data.
Related papers
- Efficient Dynamic Attention 3D Convolution for Hyperspectral Image Classification [12.168520751389622]
This paper proposes a dynamic attention convolution design based on an improved 3D-DenseNet model.
The design employs multiple parallel convolutional kernels instead of a single kernel and assigns dynamic attention weights to these parallel convolutions.
The proposed method demonstrates superior performance in both inference speed and accuracy, outperforming mainstream hyperspectral image classification methods on the IN, UP, and KSC datasets.
arXiv Detail & Related papers (2025-03-30T15:12:23Z) - iFlame: Interleaving Full and Linear Attention for Efficient Mesh Generation [49.8026360054331]
iFlame is a novel transformer-based network architecture for mesh generation.
We propose an interleaving autoregressive mesh generation framework that combines the efficiency of linear attention with the expressive power of full attention mechanisms.
Our results indicate that the proposed interleaving framework effectively balances computational efficiency and generative performance.
arXiv Detail & Related papers (2025-03-20T19:10:37Z) - Towards Efficient Large Scale Spatial-Temporal Time Series Forecasting via Improved Inverted Transformers [33.176126599147466]
EiFormer is an improved inverted transformer architecture that maintains the adaptive capabilities of iTransformer while reducing computational complexity to linear scale.
Our key innovation lies in restructuring the attention mechanism to eliminate redundant computations without sacrificing model expressiveness.
Our approach enables practical deployment of transformer-based forecasting in industrial applications where handling time series at scale is essential.
arXiv Detail & Related papers (2025-03-13T20:14:08Z) - BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN)
We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations.
Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z) - CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up [64.38715211969516]
We introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token.
Experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity.
arXiv Detail & Related papers (2024-12-20T17:57:09Z) - Mixture of Hidden-Dimensions Transformer [50.40325486463241]
We study hidden dimension sparsity and observe that trained Transformers utilize only a small fraction of token dimensions.
We propose MoHD (Mixture of Hidden Dimensions), a sparse conditional activation architecture.
It achieves 1.7% higher performance with 50% fewer activation parameters and 3.7% higher performance with a 3x parameter expansion at constant activation cost.
arXiv Detail & Related papers (2024-12-07T13:15:22Z) - ELASTIC: Efficient Linear Attention for Sequential Interest Compression [5.689306819772134]
State-of-the-art sequential recommendation models heavily rely on transformer's attention mechanism.
We propose ELASTIC, an Efficient Linear Attention for SequenTial Interest Compression.
We conduct extensive experiments on various public datasets and compare it with several strong sequential recommenders.
arXiv Detail & Related papers (2024-08-18T06:41:46Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z) - FormerTime: Hierarchical Multi-Scale Representations for Multivariate
Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task.
It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Adaptive Multi-Resolution Attention with Linear Complexity [18.64163036371161]
We propose a novel structure named Adaptive Multi-Resolution Attention (AdaMRA) for short.
We leverage a multi-resolution multi-head attention mechanism, enabling attention heads to capture long-range contextual information in a coarse-to-fine fashion.
To facilitate AdaMRA utilization by the scientific community, the code implementation will be made publicly available.
arXiv Detail & Related papers (2021-08-10T23:17:16Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z) - Kronecker Attention Networks [69.22257624495899]
We develop Kronecker attention operators (KAOs) that operate on high-order tensor data directly.
Results show that our methods reduce the amount of required computational resources by a factor of hundreds.
arXiv Detail & Related papers (2020-07-16T16:26:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.