Undivided Attention: Are Intermediate Layers Necessary for BERT?
- URL: http://arxiv.org/abs/2012.11881v2
- Date: Tue, 4 Apr 2023 21:10:29 GMT
- Title: Undivided Attention: Are Intermediate Layers Necessary for BERT?
- Authors: Sharath Nittur Sridhar, Anthony Sarah
- Abstract summary: We investigate the importance of intermediate layers on the overall network performance of downstream tasks.
We show that reducing the number of intermediate layers and modifying the architecture for BERT-BASE results in minimal loss in fine-tuning accuracy for downstream tasks.
- Score: 2.8935588665357077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent times, BERT-based models have been extremely successful in solving
a variety of natural language processing (NLP) tasks such as reading
comprehension, natural language inference, sentiment analysis, etc. All
BERT-based architectures have a self-attention block followed by a block of
intermediate layers as the basic building component. However, a strong
justification for the inclusion of these intermediate layers remains missing in
the literature. In this work we investigate the importance of intermediate
layers on the overall network performance of downstream tasks. We show that
reducing the number of intermediate layers and modifying the architecture for
BERT-BASE results in minimal loss in fine-tuning accuracy for downstream tasks
while decreasing the number of parameters and training time of the model.
Additionally, we use centered kernel alignment and probing linear classifiers
to gain insight into our architectural modifications and justify that removal
of intermediate layers has little impact on the fine-tuned accuracy.
Related papers
- SDF-TopoNet: A Two-Stage Framework for Tubular Structure Segmentation via SDF Pre-training and Topology-Aware Fine-Tuning [2.3436632098950456]
Key challenge is ensuring topological correctness while maintaining computational efficiency.
We propose textbfSDF-TopoNet, an improved topology-aware segmentation framework.
We show that SDF-TopoNet outperforms existing methods in both topological accuracy and quantitative segmentation metrics.
arXiv Detail & Related papers (2025-03-14T23:54:38Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - Neural Architecture Search for Sentence Classification with BERT [4.862490782515929]
We perform an AutoML search to find architectures that outperform the current single layer at only a small compute cost.
We validate our classification architecture on a variety of NLP benchmarks from the GLUE dataset.
arXiv Detail & Related papers (2024-03-27T13:25:43Z) - Topic-driven Distant Supervision Framework for Macro-level Discourse
Parsing [72.14449502499535]
The task of analyzing the internal rhetorical structure of texts is a challenging problem in natural language processing.
Despite the recent advances in neural models, the lack of large-scale, high-quality corpora for training remains a major obstacle.
Recent studies have attempted to overcome this limitation by using distant supervision.
arXiv Detail & Related papers (2023-05-23T07:13:51Z) - Masked Image Modeling with Local Multi-Scale Reconstruction [54.91442074100597]
Masked Image Modeling (MIM) achieves outstanding success in self-supervised representation learning.
Existing MIM models conduct reconstruction task only at the top layer of encoder.
We design local multi-scale reconstruction, where the lower and upper layers reconstruct fine-scale and coarse-scale supervision signals respectively.
arXiv Detail & Related papers (2023-03-09T13:42:04Z) - Pretraining Without Attention [114.99187017618408]
This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs)
BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation.
arXiv Detail & Related papers (2022-12-20T18:50:08Z) - Don't Judge a Language Model by Its Last Layer: Contrastive Learning
with Layer-Wise Attention Pooling [6.501126898523172]
Recent pre-trained language models (PLMs) achieved great success on many natural language processing tasks through learning linguistic features and contextualized sentence representation.
This paper introduces the attention-based pooling strategy, which enables the model to preserve layer-wise signals captured in each layer and learn digested linguistic features for downstream tasks.
arXiv Detail & Related papers (2022-09-13T13:09:49Z) - The Topological BERT: Transforming Attention into Topology for Natural
Language Processing [0.0]
This paper introduces a text classifier using topological data analysis.
We use BERT's attention maps transformed into attention graphs as the only input to that classifier.
The model can solve tasks such as distinguishing spam from ham messages, recognizing whether a sentence is grammatically correct, or evaluating a movie review as negative or positive.
arXiv Detail & Related papers (2022-06-30T11:25:31Z) - TrimBERT: Tailoring BERT for Trade-offs [6.068076825261616]
We show that reducing the number of intermediate layers in BERT-Base results in minimal fine-tuning accuracy loss of downstream tasks.
We further mitigate two key bottlenecks, by replacing all softmax operations in the self-attention layers with a computationally simpler alternative.
arXiv Detail & Related papers (2022-02-24T23:06:29Z) - A Simple Baseline for Semi-supervised Semantic Segmentation with Strong
Data Augmentation [74.8791451327354]
We propose a simple yet effective semi-supervised learning framework for semantic segmentation.
A set of simple design and training techniques can collectively improve the performance of semi-supervised semantic segmentation significantly.
Our method achieves state-of-the-art results in the semi-supervised settings on the Cityscapes and Pascal VOC datasets.
arXiv Detail & Related papers (2021-04-15T06:01:39Z) - Prior Guided Feature Enrichment Network for Few-Shot Segmentation [64.91560451900125]
State-of-the-art semantic segmentation methods require sufficient labeled data to achieve good results.
Few-shot segmentation is proposed to tackle this problem by learning a model that quickly adapts to new classes with a few labeled support samples.
Theses frameworks still face the challenge of generalization ability reduction on unseen classes due to inappropriate use of high-level semantic information.
arXiv Detail & Related papers (2020-08-04T10:41:32Z) - Belief Propagation Reloaded: Learning BP-Layers for Labeling Problems [83.98774574197613]
We take one of the simplest inference methods, a truncated max-product Belief propagation, and add what is necessary to make it a proper component of a deep learning model.
This BP-Layer can be used as the final or an intermediate block in convolutional neural networks (CNNs)
The model is applicable to a range of dense prediction problems, is well-trainable and provides parameter-efficient and robust solutions in stereo, optical flow and semantic segmentation.
arXiv Detail & Related papers (2020-03-13T13:11:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.