Not all layers are equally as important: Every Layer Counts BERT
- URL: http://arxiv.org/abs/2311.02265v2
- Date: Tue, 7 Nov 2023 21:36:11 GMT
- Title: Not all layers are equally as important: Every Layer Counts BERT
- Authors: Lucas Georges Gabriel Charpentier and David Samuel
- Abstract summary: This paper introduces a novel modification of the transformer architecture, tailored for the data-efficient pretraining of language models.
Our approach allows each transformer layer to select which outputs of previous layers to process.
- Score: 5.121744234312891
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces a novel modification of the transformer architecture,
tailored for the data-efficient pretraining of language models. This aspect is
evaluated by participating in the BabyLM challenge, where our solution won both
the strict and strict-small tracks. Our approach allows each transformer layer
to select which outputs of previous layers to process. The empirical results
verify the potential of this simple modification and show that not all layers
are equally as important.
Related papers
- Transformer Layers as Painters [16.43731831488477]
We show that lower and final layers of pretrained transformers differ from middle layers, but that middle layers have a surprising amount of uniformity.
We also show that some classes of problems have to robustness to skipping layers, running the layers in an order different from how they were trained, or running the layers in parallel.
Our observations suggest that even frozen pretrained models may gracefully trade accuracy for latency by skipping layers or running layers in parallel.
arXiv Detail & Related papers (2024-07-12T14:31:05Z) - Entropy Guided Extrapolative Decoding to Improve Factuality in Large Language Models [55.45444773200529]
Large language models (LLMs) exhibit impressive natural language capabilities but suffer from hallucination.
Recent work has focused on decoding techniques to improve factuality during inference.
arXiv Detail & Related papers (2024-04-14T19:45:35Z) - Jump to Conclusions: Short-Cutting Transformers With Linear Transformations [60.37563766047492]
Transformer-based language models create hidden representations of their inputs at every layer, but only use final-layer representations for prediction.
This obscures the internal decision-making process of the model and the utility of its intermediate representations.
We suggest a simple method for such casting, using linear transformations.
arXiv Detail & Related papers (2023-03-16T16:10:16Z) - On Robust Learning from Noisy Labels: A Permutation Layer Approach [53.798757734297986]
This paper introduces a permutation layer learning approach termed PermLL to dynamically calibrate the training process of a deep neural network (DNN)
We provide two variants of PermLL in this paper: one applies the permutation layer to the model's prediction, while the other applies it directly to the given noisy label.
We validate PermLL experimentally and show that it achieves state-of-the-art performance on both real and synthetic datasets.
arXiv Detail & Related papers (2022-11-29T03:01:48Z) - Unifying Global-Local Representations in Salient Object Detection with Transformer [55.23033277636774]
We introduce a new attention-based encoder, vision transformer, into salient object detection.
With the global view in very shallow layers, the transformer encoder preserves more local representations.
Our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks.
arXiv Detail & Related papers (2021-08-05T17:51:32Z) - ProgressiveSpinalNet architecture for FC layers [0.0]
In deeplearning models the FC layer has biggest important role for classification of the input based on the learned features from previous layers.
This paper aims to reduce these large numbers of parameters significantly with improved performance.
The motivation is inspired from SpinalNet and other biological architecture.
arXiv Detail & Related papers (2021-03-21T11:54:50Z) - IOT: Instance-wise Layer Reordering for Transformer Structures [173.39918590438245]
We break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure.
Our method can also be applied to other architectures beyond Transformer.
arXiv Detail & Related papers (2021-03-05T03:44:42Z) - Of Non-Linearity and Commutativity in BERT [8.295319152986316]
We study the interactions between layers in BERT and show that, while the layers exhibit some hierarchical structure, they extract features in a fuzzy manner.
Our results suggest that BERT has an inductive bias towards layer commutativity, which we find is mainly due to the skip connections.
arXiv Detail & Related papers (2021-01-12T15:29:38Z) - BERT's output layer recognizes all hidden layers? Some Intriguing
Phenomena and a simple way to boost BERT [53.63288887672302]
Bidirectional Representations from Transformers (BERT) have achieved tremendous success in many natural language processing (NLP) tasks.
We find that surprisingly the output layer of BERT can reconstruct the input sentence by directly taking each layer of BERT as input.
We propose a quite simple method to boost the performance of BERT.
arXiv Detail & Related papers (2020-01-25T13:35:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.