Improving BERT with Hybrid Pooling Network and Drop Mask
- URL: http://arxiv.org/abs/2307.07258v1
- Date: Fri, 14 Jul 2023 10:20:08 GMT
- Title: Improving BERT with Hybrid Pooling Network and Drop Mask
- Authors: Qian Chen, Wen Wang, Qinglin Zhang, Chong Deng, Ma Yukun, Siqi Zheng
- Abstract summary: BERT captures a rich hierarchy of linguistic information at different layers.
vanilla BERT uses the same self-attention mechanism for each layer to model the different contextual features.
We propose a HybridBERT model which combines self-attention and pooling networks to encode different contextual features in each layer.
- Score: 7.132769083122907
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based pre-trained language models, such as BERT, achieve great
success in various natural language understanding tasks. Prior research found
that BERT captures a rich hierarchy of linguistic information at different
layers. However, the vanilla BERT uses the same self-attention mechanism for
each layer to model the different contextual features. In this paper, we
propose a HybridBERT model which combines self-attention and pooling networks
to encode different contextual features in each layer. Additionally, we propose
a simple DropMask method to address the mismatch between pre-training and
fine-tuning caused by excessive use of special mask tokens during Masked
Language Modeling pre-training. Experiments show that HybridBERT outperforms
BERT in pre-training with lower loss, faster training speed (8% relative),
lower memory cost (13% relative), and also in transfer learning with 1.5%
relative higher accuracies on downstream tasks. Additionally, DropMask improves
accuracies of BERT on downstream tasks across various masking rates.
Related papers
- Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi [0.0]
We introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data.
Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding.
arXiv Detail & Related papers (2023-09-19T02:59:41Z) - Weighted Sampling for Masked Language Modeling [12.25238763907731]
We propose two simple and effective Weighted Sampling strategies for masking tokens based on the token frequency and training loss.
We apply these two strategies to BERT and obtain Weighted-Sampled BERT (WSBERT)
arXiv Detail & Related papers (2023-02-28T01:07:39Z) - NarrowBERT: Accelerating Masked Language Model Pretraining and Inference [50.59811343945605]
We propose NarrowBERT, a modified transformer encoder that increases the throughput for masked language model pretraining by more than $2times$.
NarrowBERT sparsifies the transformer model such that the self-attention queries and feedforward layers only operate on the masked tokens of each sentence during pretraining.
We show that NarrowBERT increases the throughput at inference time by as much as $3.5times$ with minimal (or no) performance degradation on sentence encoding tasks like MNLI.
arXiv Detail & Related papers (2023-01-11T23:45:50Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - DecBERT: Enhancing the Language Understanding of BERT with Causal
Attention Masks [33.558503823505056]
In this work, we focus on improving the position encoding ability of BERT with the causal attention masks.
We propose a new pre-trained language model DecBERT and evaluate it on the GLUE benchmark.
Experimental results show that (1) the causal attention mask is effective for BERT on the language understanding tasks; (2) our DecBERT model without position embeddings achieve comparable performance on the GLUE benchmark; and (3) our modification accelerates the pre-training process and DecBERT achieves better overall performance than the baseline systems.
arXiv Detail & Related papers (2022-04-19T06:12:48Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Incorporating BERT into Parallel Sequence Decoding with Adapters [82.65608966202396]
We propose to take two different BERT models as the encoder and decoder respectively, and fine-tune them by introducing simple and lightweight adapter modules.
We obtain a flexible and efficient model which is able to jointly leverage the information contained in the source-side and target-side BERT models.
Our framework is based on a parallel sequence decoding algorithm named Mask-Predict considering the bi-directional and conditional independent nature of BERT.
arXiv Detail & Related papers (2020-10-13T03:25:15Z) - ConvBERT: Improving BERT with Span-based Dynamic Convolution [144.25748617961082]
BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost.
We propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies.
The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning.
arXiv Detail & Related papers (2020-08-06T07:43:19Z) - BERT's output layer recognizes all hidden layers? Some Intriguing
Phenomena and a simple way to boost BERT [53.63288887672302]
Bidirectional Representations from Transformers (BERT) have achieved tremendous success in many natural language processing (NLP) tasks.
We find that surprisingly the output layer of BERT can reconstruct the input sentence by directly taking each layer of BERT as input.
We propose a quite simple method to boost the performance of BERT.
arXiv Detail & Related papers (2020-01-25T13:35:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.