Related papers: Mixture of Hidden-Dimensions Transformer

Mixture of Hidden-Dimensions Transformer

URL: http://arxiv.org/abs/2412.05644v3
Date: Mon, 16 Dec 2024 12:12:19 GMT
Title: Mixture of Hidden-Dimensions Transformer
Authors: Yilong Chen, Junyuan Shang, Zhengyu Zhang, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang,
Abstract summary: We study hidden dimension sparsity and observe that trained Transformers utilize only a small fraction of token dimensions.<n>We propose MoHD (Mixture of Hidden Dimensions), a sparse conditional activation architecture.<n>It achieves 1.7% higher performance with 50% fewer activation parameters and 3.7% higher performance with a 3x parameter expansion at constant activation cost.
Score: 50.40325486463241
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Transformer models encounter challenges in scaling hidden dimensions efficiently, as uniformly increasing them inflates computational and memory costs while failing to emphasize the most relevant features for each token. For further understanding, we study hidden dimension sparsity and observe that trained Transformers utilize only a small fraction of token dimensions, revealing an "activation flow" pattern. Notably, there are shared sub-dimensions with sustained activation across multiple consecutive tokens and specialized sub-dimensions uniquely activated for each token. To better model token-relevant sub-dimensions, we propose MoHD (Mixture of Hidden Dimensions), a sparse conditional activation architecture. Particularly, MoHD employs shared sub-dimensions for common token features and a routing mechanism to dynamically activate specialized sub-dimensions. To mitigate potential information loss from sparsity, we design activation scaling and group fusion mechanisms to preserve activation flow. In this way, MoHD expands hidden dimensions with negligible increases in computation or parameters, efficient training and inference while maintaining performance. Evaluations across 10 NLP tasks show that MoHD surpasses Vanilla Transformers in parameter efficiency and task performance. It achieves 1.7% higher performance with 50% fewer activation parameters and 3.7% higher performance with a 3x parameter expansion at constant activation cost. MOHD offers a new perspective for scaling the model, showcasing the potential of hidden dimension sparsity to boost efficiency

Related papers

Higher Order Transformers: Efficient Attention Mechanism for Tensor Structured Data [10.327160288730125]
Higher-Order Transformers (HOT) are designed to process data with more than two axes, i.e. higher-order tensors.<n>To address the computational challenges associated with high-order tensor attention, we introduce a novel Kronecker factorized attention mechanism.<n>We validate the effectiveness of HOT on two high-dimensional tasks, including multivariate time series forecasting, and 3D medical image classification.
arXiv Detail & Related papers (2024-12-04T00:10:47Z)
CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism. We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies. By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z)
HASN: Hybrid Attention Separable Network for Efficient Image Super-resolution [5.110892180215454]
lightweight methods for single image super-resolution achieved impressive performance due to limited hardware resources. We find that using residual connections after each block increases the model's storage and computational cost. We use depthwise separable convolutions, fully connected layers, and activation functions as the basic feature extraction modules.
arXiv Detail & Related papers (2024-10-13T14:00:21Z)
CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [73.80247057590519]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability. We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications. Our model achieves 83.0%/84.1% top-1 with only 12M/21M parameters on ImageNet-1K.
arXiv Detail & Related papers (2024-08-07T11:33:46Z)
Adapter-X: A Novel General Parameter-Efficient Fine-Tuning Framework for Vision [52.80792724919329]
We introduce a novel framework named Adapter-X to improve fine-tuning in 2D image and 3D point cloud modalities. It is the first to outperform full fine-tuning in both 2D image and 3D point cloud modalities with significantly fewer parameters, i.e., only 0.20% and 1.88% of original trainable parameters for 2D and 3D classification tasks.
arXiv Detail & Related papers (2024-06-05T08:26:44Z)
From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers [52.199303258423306]
We propose a novel density loss that encourages higher activation sparsity in pre-trained models. Our proposed method, textbfDEFT, can consistently reduce activation density by up to textbf44.94% on RoBERTa$_mathrmLarge$ and by textbf53.19% (encoder density) and textbf90.60% (decoder density) on Flan-T5$_mathrmXXL$.
arXiv Detail & Related papers (2024-02-02T21:25:46Z)
Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion [4.716845031095804]
Transformer models can face practical limitations due to their high computational requirements. Such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by converting parts of the network into equivalent Mixture-of-Experts (MoE) layers. We demonstrate that the efficiency of the conversion can be significantly enhanced by a proper regularization of the activation sparsity of the base model.
arXiv Detail & Related papers (2023-10-06T16:34:51Z)
SparCA: Sparse Compressed Agglomeration for Feature Extraction and Dimensionality Reduction [0.0]
We propose sparse compressed agglomeration (SparCA) as a novel dimensionality reduction procedure. SparCA is applicable to a wide range of data types, produces highly interpretable features, and shows compelling performance on downstream supervised learning tasks.
arXiv Detail & Related papers (2023-01-26T13:59:15Z)
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z)
Augmentations: An Insight into their Effectiveness on Convolution Neural Networks [0.0]
The ability to boost a model's robustness depends on two factors, viz-a-viz, the model architecture, and the type of augmentations. This paper evaluates the effect of parameters using 3x3 and depth-wise separable convolutions on different augmentation techniques.
arXiv Detail & Related papers (2022-05-09T06:36:40Z)
Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks. We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer. The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.