DecBERT: Enhancing the Language Understanding of BERT with Causal
Attention Masks
- URL: http://arxiv.org/abs/2204.08688v1
- Date: Tue, 19 Apr 2022 06:12:48 GMT
- Title: DecBERT: Enhancing the Language Understanding of BERT with Causal
Attention Masks
- Authors: Ziyang Luo, Yadong Xi, Jing Ma, Zhiwei Yang, Xiaoxi Mao, Changjie Fan,
Rongsheng Zhang
- Abstract summary: In this work, we focus on improving the position encoding ability of BERT with the causal attention masks.
We propose a new pre-trained language model DecBERT and evaluate it on the GLUE benchmark.
Experimental results show that (1) the causal attention mask is effective for BERT on the language understanding tasks; (2) our DecBERT model without position embeddings achieve comparable performance on the GLUE benchmark; and (3) our modification accelerates the pre-training process and DecBERT achieves better overall performance than the baseline systems.
- Score: 33.558503823505056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since 2017, the Transformer-based models play critical roles in various
downstream Natural Language Processing tasks. However, a common limitation of
the attention mechanism utilized in Transformer Encoder is that it cannot
automatically capture the information of word order, so explicit position
embeddings are generally required to be fed into the target model. In contrast,
Transformer Decoder with the causal attention masks is naturally sensitive to
the word order. In this work, we focus on improving the position encoding
ability of BERT with the causal attention masks. Furthermore, we propose a new
pre-trained language model DecBERT and evaluate it on the GLUE benchmark.
Experimental results show that (1) the causal attention mask is effective for
BERT on the language understanding tasks; (2) our DecBERT model without
position embeddings achieve comparable performance on the GLUE benchmark; and
(3) our modification accelerates the pre-training process and DecBERT w/ PE
achieves better overall performance than the baseline systems when pre-training
with the same amount of computational resources.
Related papers
- StableMask: Refining Causal Masking in Decoder-only Transformer [22.75632485195928]
decoder-only Transformer architecture with causal masking and relative position encoding (RPE) has become the de facto choice in language modeling.
However, it requires all attention scores to be non-zero and sum up to 1, even if the current embedding has sufficient self-contained information.
We propose StableMask: a parameter-free method to address both limitations by refining the causal mask.
arXiv Detail & Related papers (2024-02-07T12:01:02Z) - BPDec: Unveiling the Potential of Masked Language Modeling Decoder in BERT pretraining [0.5919433278490629]
BERT (Bidirectional Representations from Transformers) has revolutionized the field of natural language processing through its exceptional performance on numerous tasks.
DeBERTa introduced an enhanced decoder adapted for BERT's encoder model for pretraining, proving to be highly effective.
We argue that the design and research around enhanced masked language modeling decoders have been underappreciated.
arXiv Detail & Related papers (2024-01-29T03:25:11Z) - Improving BERT with Hybrid Pooling Network and Drop Mask [7.132769083122907]
BERT captures a rich hierarchy of linguistic information at different layers.
vanilla BERT uses the same self-attention mechanism for each layer to model the different contextual features.
We propose a HybridBERT model which combines self-attention and pooling networks to encode different contextual features in each layer.
arXiv Detail & Related papers (2023-07-14T10:20:08Z) - NarrowBERT: Accelerating Masked Language Model Pretraining and Inference [50.59811343945605]
We propose NarrowBERT, a modified transformer encoder that increases the throughput for masked language model pretraining by more than $2times$.
NarrowBERT sparsifies the transformer model such that the self-attention queries and feedforward layers only operate on the masked tokens of each sentence during pretraining.
We show that NarrowBERT increases the throughput at inference time by as much as $3.5times$ with minimal (or no) performance degradation on sentence encoding tasks like MNLI.
arXiv Detail & Related papers (2023-01-11T23:45:50Z) - Word Order Matters when you Increase Masking [70.29624135819884]
We study the effect of removing position encodings on the pre-training objective itself, to test whether models can reconstruct position information from co-occurrences alone.
We find that the necessity of position information increases with the amount of masking, and that masked language models without position encodings are not able to reconstruct this information on the task.
arXiv Detail & Related papers (2022-11-08T18:14:04Z) - Position Prediction as an Effective Pretraining Strategy [20.925906203643883]
We propose a novel, but surprisingly simple alternative to content reconstruction-- that of predicting locations from content, without providing positional information for it.
Our approach brings improvements over strong supervised training baselines and is comparable to modern unsupervised/self-supervised pretraining methods.
arXiv Detail & Related papers (2022-07-15T17:10:48Z) - XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems
to Improve Language Understanding [73.24847320536813]
This study explores distilling visual information from pretrained multimodal transformers to pretrained language encoders.
Our framework is inspired by cross-modal encoders' success in visual-language tasks while we alter the learning objective to cater to the language-heavy characteristics of NLU.
arXiv Detail & Related papers (2022-04-15T03:44:00Z) - Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models.
We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z) - ConvBERT: Improving BERT with Span-based Dynamic Convolution [144.25748617961082]
BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost.
We propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies.
The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning.
arXiv Detail & Related papers (2020-08-06T07:43:19Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.