Related papers: Does Self-Attention Need Separate Weights in Transformers?

Does Self-Attention Need Separate Weights in Transformers?

URL: http://arxiv.org/abs/2412.00359v1
Date: Sat, 30 Nov 2024 04:46:20 GMT
Title: Does Self-Attention Need Separate Weights in Transformers?
Authors: Md Kowsher, Nusrat Jahan Prottasha, Chun-Nam Yu,
Abstract summary: This work introduces a shared weight self-attention-based BERT model that only learns one weight matrix for (Key, Value, and Query) representations.<n> Experimental results indicate that our shared self-attention method achieves a parameter size reduction of 66.53% in the attention block.<n>In the GLUE dataset, the shared weight self-attention-based BERT model demonstrates accuracy improvements of 0.38%, 5.81%, and 1.06% over the standard, symmetric, and pairwise attention-based BERT models.
Score: 0.884834042985207
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The success of self-attention lies in its ability to capture long-range dependencies and enhance context understanding, but it is limited by its computational complexity and challenges in handling sequential data with inherent directionality. This work introduces a shared weight self-attention-based BERT model that only learns one weight matrix for (Key, Value, and Query) representations instead of three individual matrices for each of them. Our shared weight attention reduces the training parameter size by more than half and training time by around one-tenth. Furthermore, we demonstrate higher prediction accuracy on small tasks of GLUE over the BERT baseline and in particular a generalization power on noisy and out-of-domain data. Experimental results indicate that our shared self-attention method achieves a parameter size reduction of 66.53% in the attention block. In the GLUE dataset, the shared weight self-attention-based BERT model demonstrates accuracy improvements of 0.38%, 5.81%, and 1.06% over the standard, symmetric, and pairwise attention-based BERT models, respectively. The model and source code are available at Anonymous.

Related papers

Efficient Multi-Task Inferencing: Model Merging with Gromov-Wasserstein Feature Alignment [7.436562917907035]
This paper introduces the Gromov-Wasserstein Scoring Model Merging (GW-SMM) method. It merges models based on feature distribution similarities measured via the Gromov-Wasserstein distance. We validated our approach against human expert knowledge and a GPT-o1-based merging method.
arXiv Detail & Related papers (2025-03-12T19:20:33Z)
Head-wise Shareable Attention for Large Language Models [56.92068213969036]
Large Language Models (LLMs) suffer from huge number of parameters, which restricts their deployment on edge devices. Weight sharing is one promising solution that encourages weight reuse, effectively reducing memory usage with less performance drop. We present a perspective on head-wise shareable attention for large language models.
arXiv Detail & Related papers (2024-02-19T04:19:36Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
Parameter-Efficient Sparsity for Large Language Models Fine-Tuning [63.321205487234074]
We propose a. sparse-efficient Sparse Training (PST) method to reduce the number of trainable parameters during sparse-aware training. Experiments with diverse networks (i.e., BERT, RoBERTa and GPT-2) demonstrate PST performs on par or better than previous sparsity methods.
arXiv Detail & Related papers (2022-05-23T02:43:45Z)
Improved Regularization and Robustness for Fine-tuning in Neural Networks [5.626364462708321]
A widely used algorithm for transfer learning is fine-tuning, where a pre-trained model is fine-tuned on a target task with a small amount of labeled data. We propose regularized self-labeling -- the generalization between regularization and self-labeling. Our approach improves baseline methods by 1.76% (on average) for seven image classification tasks and 0.75% for a few-shot classification task.
arXiv Detail & Related papers (2021-11-08T15:39:44Z)
BERT might be Overkill: A Tiny but Effective Biomedical Entity Linker based on Residual Convolutional Neural Networks [41.528797439272175]
We propose an efficient convolutional neural network with residual connections for biomedical entity linking. Our model achieves comparable or even better linking accuracy than the state-of-the-art BERT-based models.
arXiv Detail & Related papers (2021-09-06T04:25:47Z)
Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning. Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch. ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z)
ConvBERT: Improving BERT with Span-based Dynamic Convolution [144.25748617961082]
BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost. We propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning.
arXiv Detail & Related papers (2020-08-06T07:43:19Z)
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models. We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.