Higher Order Transformers: Efficient Attention Mechanism for Tensor Structured Data
- URL: http://arxiv.org/abs/2412.02919v1
- Date: Wed, 04 Dec 2024 00:10:47 GMT
- Title: Higher Order Transformers: Efficient Attention Mechanism for Tensor Structured Data
- Authors: Soroush Omranpour, Guillaume Rabusseau, Reihaneh Rabbany,
- Abstract summary: Higher-Order Transformers (HOT) are designed to process data with more than two axes, i.e. higher-order tensors.
To address the computational challenges associated with high-order tensor attention, we introduce a novel Kronecker factorized attention mechanism.
We validate the effectiveness of HOT on two high-dimensional tasks, including multivariate time series forecasting, and 3D medical image classification.
- Score: 10.327160288730125
- License:
- Abstract: Transformers are now ubiquitous for sequence modeling tasks, but their extension to multi-dimensional data remains a challenge due to the quadratic cost of the attention mechanism. In this paper, we propose Higher-Order Transformers (HOT), a novel architecture designed to efficiently process data with more than two axes, i.e. higher-order tensors. To address the computational challenges associated with high-order tensor attention, we introduce a novel Kronecker factorized attention mechanism that reduces the attention cost to quadratic in each axis' dimension, rather than quadratic in the total size of the input tensor. To further enhance efficiency, HOT leverages kernelized attention, reducing the complexity to linear. This strategy maintains the model's expressiveness while enabling scalable attention computation. We validate the effectiveness of HOT on two high-dimensional tasks, including multivariate time series forecasting, and 3D medical image classification. Experimental results demonstrate that HOT achieves competitive performance while significantly improving computational efficiency, showcasing its potential for tackling a wide range of complex, multi-dimensional data.
Related papers
- DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training [85.04885553561164]
Diffusion Transformers (DiTs) have shown remarkable performance in modeling and generating high-quality videos.
This paper introduces DSV, a novel framework designed to accelerate and scale the training of video DiTs.
arXiv Detail & Related papers (2025-02-11T14:39:59Z) - CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up [64.38715211969516]
We introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token.
Experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity.
arXiv Detail & Related papers (2024-12-20T17:57:09Z) - Mixture of Hidden-Dimensions Transformer [50.40325486463241]
We study hidden dimension sparsity and observe that trained Transformers utilize only a small fraction of token dimensions.
We propose MoHD (Mixture of Hidden Dimensions), a sparse conditional activation architecture.
It achieves 1.7% higher performance with 50% fewer activation parameters and 3.7% higher performance with a 3x parameter expansion at constant activation cost.
arXiv Detail & Related papers (2024-12-07T13:15:22Z) - MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices [24.1144641404561]
We propose a scheme for exact attention inference acceleration on memory-constrained edge accelerators.
We show up to 2.75x speedup and 54% reduction in energy consumption as compared to the state-of-the-art attention fusion method (FLAT) in the edge computing scenario.
arXiv Detail & Related papers (2024-11-20T19:44:26Z) - ELASTIC: Efficient Linear Attention for Sequential Interest Compression [5.689306819772134]
State-of-the-art sequential recommendation models heavily rely on transformer's attention mechanism.
We propose ELASTIC, an Efficient Linear Attention for SequenTial Interest Compression.
We conduct extensive experiments on various public datasets and compare it with several strong sequential recommenders.
arXiv Detail & Related papers (2024-08-18T06:41:46Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z) - Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for
Long Sequences [16.066338004414092]
textitDiffuser is a new efficient Transformer for sequence-to-sequence modeling.
It incorporates all token interactions within one attention layer while maintaining low computation and memory costs.
We show its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective.
arXiv Detail & Related papers (2022-10-21T08:13:34Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Adaptive Multi-Resolution Attention with Linear Complexity [18.64163036371161]
We propose a novel structure named Adaptive Multi-Resolution Attention (AdaMRA) for short.
We leverage a multi-resolution multi-head attention mechanism, enabling attention heads to capture long-range contextual information in a coarse-to-fine fashion.
To facilitate AdaMRA utilization by the scientific community, the code implementation will be made publicly available.
arXiv Detail & Related papers (2021-08-10T23:17:16Z) - Kronecker Attention Networks [69.22257624495899]
We develop Kronecker attention operators (KAOs) that operate on high-order tensor data directly.
Results show that our methods reduce the amount of required computational resources by a factor of hundreds.
arXiv Detail & Related papers (2020-07-16T16:26:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.