GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures
- URL: http://arxiv.org/abs/2106.05822v1
- Date: Thu, 10 Jun 2021 15:41:53 GMT
- Title: GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures
- Authors: Ivan Chelombiev, Daniel Justus, Douglas Orr, Anastasia Dietrich,
Frithjof Gressmann, Alexandros Koliousis, Carlo Luschi
- Abstract summary: We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
- Score: 57.46093180685175
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Attention based language models have become a critical component in
state-of-the-art natural language processing systems. However, these models
have significant computational requirements, due to long training times, dense
operations and large parameter count. In this work we demonstrate a set of
modifications to the structure of a Transformer layer, producing a more
efficient architecture. First, we add a convolutional module to complement the
self-attention module, decoupling the learning of local and global
interactions. Secondly, we rely on grouped transformations to reduce the
computational cost of dense feed-forward layers and convolutions, while
preserving the expressivity of the model. We apply the resulting architecture
to language representation learning and demonstrate its superior performance
compared to BERT models of different scales. We further highlight its improved
efficiency, both in terms of floating-point operations (FLOPs) and
time-to-train.
Related papers
- Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers [56.264673865476986]
This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models.
SLA improves the model's ability to capture dependencies between high-level abstract features and low-level details.
Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer.
arXiv Detail & Related papers (2024-06-17T07:24:38Z) - Towards A Unified View of Sparse Feed-Forward Network in Pretraining
Large Language Model [58.9100867327305]
Large and sparse feed-forward layers (S-FFN) have proven effective in scaling up Transformers model size for textitpretraining large language models.
We analyzed two major design choices of S-FFN: the memory block (a.k.a. expert) size and the memory block selection method.
We found a simpler selection method -- textbftextttAvg-K that selects blocks through their mean aggregated hidden states, achieving lower perplexity in language model pretraining.
arXiv Detail & Related papers (2023-05-23T12:28:37Z) - Pre-Training a Graph Recurrent Network for Language Representation [34.4554387894105]
We consider a graph recurrent network for language model pre-training, which builds a graph structure for each sequence with local token-level communications.
We find that our model can generate more diverse outputs with less contextualized feature redundancy than existing attention-based models.
arXiv Detail & Related papers (2022-09-08T14:12:15Z) - Structural Biases for Improving Transformers on Translation into
Morphologically Rich Languages [120.74406230847904]
TP-Transformer augments the traditional Transformer architecture to include an additional component to represent structure.
The second method imbues structure at the data level by segmenting the data with morphological tokenization.
We find that each of these two approaches allows the network to achieve better performance, but this improvement is dependent on the size of the dataset.
arXiv Detail & Related papers (2022-08-11T22:42:24Z) - Transformer Grammars: Augmenting Transformer Language Models with
Syntactic Inductive Biases at Scale [31.293175512404172]
We introduce Transformer Grammars -- a class of Transformer language models that combine expressive power, scalability, and strong performance of Transformers.
We find that Transformer Grammars outperform various strong baselines on multiple syntax-sensitive language modeling evaluation metrics.
arXiv Detail & Related papers (2022-03-01T17:22:31Z) - Examining Scaling and Transfer of Language Model Architectures for
Machine Translation [51.69212730675345]
Language models (LMs) process sequences in a single stack of layers, and encoder-decoder models (EncDec) utilize separate layer stacks for input and output processing.
In machine translation, EncDec has long been the favoured approach, but with few studies investigating the performance of LMs.
arXiv Detail & Related papers (2022-02-01T16:20:15Z) - Language Modeling using LMUs: 10x Better Data Efficiency or Improved
Scaling Compared to Transformers [4.899818550820576]
We construct a Legendre Memory Unit based model that introduces a general prior for sequence processing.
We show that our new architecture attains the same accuracy as transformers with 10x fewer tokens.
arXiv Detail & Related papers (2021-10-05T23:20:37Z) - Retrofitting Structure-aware Transformer Language Model for End Tasks [34.74181162627023]
We consider retrofitting structure-aware Transformer language model for facilitating end tasks.
Middle-layer structural learning strategy is leveraged for structure integration.
Experimental results show that the retrofitted structure-aware Transformer language model achieves improved perplexity.
arXiv Detail & Related papers (2020-09-16T01:07:07Z) - Tree-structured Attention with Hierarchical Accumulation [103.47584968330325]
"Hierarchical Accumulation" encodes parse tree structures into self-attention at constant time complexity.
Our approach outperforms SOTA methods in four IWSLT translation tasks and the WMT'14 English-German translation task.
arXiv Detail & Related papers (2020-02-19T08:17:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.