How does Architecture Influence the Base Capabilities of Pre-trained
Language Models? A Case Study Based on FFN-Wider Transformer Models
- URL: http://arxiv.org/abs/2403.02436v1
- Date: Mon, 4 Mar 2024 19:33:39 GMT
- Title: How does Architecture Influence the Base Capabilities of Pre-trained
Language Models? A Case Study Based on FFN-Wider Transformer Models
- Authors: Xin Lu, Yanyan Zhao, Bing Qin
- Abstract summary: This work attempts to explain and reverse the decline in base capabilities caused by the architecture of FFN-Wider Transformers.
Through analysis, we found the contribution ratio of Multi-Head Attention (a combination function) to pre-trained language modeling is a key factor affecting base capabilities.
- Score: 34.24324719229975
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models have been proven to possess strong base
capabilities, which not only excel in in-distribution language modeling but
also show powerful abilities in out-of-distribution language modeling, transfer
learning and few-shot learning. Unlike existing work focusing on the influence
of scale on base capabilities, our work examines the influence of architecture
on those. Specifically, our concern is: How does architecture influence the
base capabilities of pre-trained language models? In this work, we attempt to
explain and reverse the decline in base capabilities caused by the architecture
of FFN-Wider Transformers, seeking to provide some insights. Through analysis,
we found the contribution ratio of Multi-Head Attention (a combination
function) to pre-trained language modeling is a key factor affecting base
capabilities. FFN-Wider Transformers reduce the contribution ratio of this
combination function, leading to a decline in base capabilities. We confirmed
this by experiments and proposed Combination Enhancement Architecture (CEA) to
address the decline in base capabilities of such models. Significantly, we
extended our explanation and CEA to Mixture of Experts (MoE) architecture
Transformers, which also alleviated their decline in base capabilities to some
extent, proving our work can offer useful guidance for architecture analysis,
architecture improvement and architecture design.
Related papers
- Activator: GLU Activations as The Core Functions of a Vision Transformer [1.3812010983144802]
Transformer architecture currently represents the main driver behind many successes in a variety of tasks addressed by deep learning.
This paper investigates substituting the attention mechanism usually adopted for transformer architecture with incorporating linear gated unit (GLU) activation within a multi-layer perceptron architecture.
arXiv Detail & Related papers (2024-05-24T21:46:52Z) - Enhancing Representations through Heterogeneous Self-Supervised Learning [61.40674648939691]
We propose Heterogeneous Self-Supervised Learning (HSSL), which enforces a base model to learn from an auxiliary head whose architecture is heterogeneous from the base model.
The HSSL endows the base model with new characteristics in a representation learning way without structural changes.
The HSSL is compatible with various self-supervised methods, achieving superior performances on various downstream tasks.
arXiv Detail & Related papers (2023-10-08T10:44:05Z) - TaCA: Upgrading Your Visual Foundation Model with Task-agnostic
Compatible Adapter [21.41170708560114]
A growing number of applications based on visual foundation models are emerging.
In situations involving system upgrades, it becomes essential to re-train all downstream modules to adapt to the new foundation model.
We introduce a parameter-efficient and task-agnostic adapter, dubbed TaCA, that facilitates compatibility across distinct foundation models.
arXiv Detail & Related papers (2023-06-22T03:00:24Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - On the interplay of adversarial robustness and architecture components:
patches, convolution and attention [65.20660287833537]
We study the effect of adversarial training on the interpretability of the learnt features and robustness to unseen threat models.
An ablation from ResNet to ConvNeXt reveals key architectural changes leading to almost $10%$ higher $ell_infty$-robustness.
arXiv Detail & Related papers (2022-09-14T22:02:32Z) - BayesFormer: Transformer with Uncertainty Estimation [31.206243748162553]
We introduce BayesFormer, a Transformer model with dropouts designed by Bayesian theory.
We show improvements across the board: language modeling and classification, long-sequence understanding, machine translation and acquisition function for active learning.
arXiv Detail & Related papers (2022-06-02T01:54:58Z) - Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained
Language Models [68.9288651177564]
We present a novel MoE architecture based on matrix product operators (MPO) from quantum many-body physics.
With the decomposed MPO structure, we can reduce the parameters of the original MoE architecture.
Experiments on the three well-known downstream natural language datasets based on GPT2 show improved performance and efficiency in increasing model capacity.
arXiv Detail & Related papers (2022-03-02T13:44:49Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.