Should You Mask 15% in Masked Language Modeling?
- URL: http://arxiv.org/abs/2202.08005v1
- Date: Wed, 16 Feb 2022 11:42:34 GMT
- Title: Should You Mask 15% in Masked Language Modeling?
- Authors: Alexander Wettig, Tianyu Gao, Zexuan Zhong, Danqi Chen
- Abstract summary: Masked language models conventionally use a masking rate of 15%.
We find that masking up to 40% of input tokens can outperform the 15% baseline.
- Score: 86.91486000124156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked language models conventionally use a masking rate of 15% due to the
belief that more masking would provide insufficient context to learn good
representations, and less masking would make training too expensive.
Surprisingly, we find that masking up to 40% of input tokens can outperform the
15% baseline, and even masking 80% can preserve most of the performance, as
measured by fine-tuning on downstream tasks. Increasing the masking rates has
two distinct effects, which we investigate through careful ablations: (1) A
larger proportion of input tokens are corrupted, reducing the context size and
creating a harder task, and (2) models perform more predictions, which benefits
training. We observe that larger models in particular favor higher masking
rates, as they have more capacity to perform the harder task. We also connect
our findings to sophisticated masking schemes such as span masking and PMI
masking, as well as BERT's curious 80-10-10 corruption strategy, and find that
simple uniform masking with [MASK] replacements can be competitive at higher
masking rates. Our results contribute to a better understanding of masked
language modeling and point to new avenues for efficient pre-training.
Related papers
- MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models [91.4190318047519]
This work introduces MaskLLM, a learnable pruning method that establishes Semi-structured (or N:M'') Sparsity in Large Language Models.
MaskLLM explicitly models N:M patterns as a learnable distribution through Gumbel Softmax sampling.
arXiv Detail & Related papers (2024-09-26T02:37:41Z) - ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework.
We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise.
We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z) - MP-Former: Mask-Piloted Transformer for Image Segmentation [16.620469868310288]
Mask2Former suffers from inconsistent mask predictions between decoder layers.
We propose a mask-piloted training approach, which feeds noised ground-truth masks in masked-attention and trains the model to reconstruct the original ones.
arXiv Detail & Related papers (2023-03-13T17:57:59Z) - Uniform Masking Prevails in Vision-Language Pretraining [26.513450527203453]
Masked Language Modeling (MLM) has proven to be an essential component of Vision-Language (VL) pretraining.
This paper shows that increasing the masking rate leads to gains in Image-Text Matching (ITM) tasks.
arXiv Detail & Related papers (2022-12-10T04:02:19Z) - Towards Improved Input Masking for Convolutional Neural Networks [66.99060157800403]
We propose a new masking method for CNNs we call layer masking.
We show that our method is able to eliminate or minimize the influence of the mask shape or color on the output of the model.
We also demonstrate how the shape of the mask may leak information about the class, thus affecting estimates of model reliance on class-relevant features.
arXiv Detail & Related papers (2022-11-26T19:31:49Z) - InforMask: Unsupervised Informative Masking for Language Model
Pretraining [13.177839395411858]
We propose a new unsupervised masking strategy for training masked language models.
InforMask exploits Pointwise Mutual Information (PMI) to select the most informative tokens to mask.
arXiv Detail & Related papers (2022-10-21T07:10:56Z) - Application of Yolo on Mask Detection Task [1.941730292017383]
Strict mask-wearing policies have been met with not only public sensation but also practical difficulty.
Existing technology to help automate mask checking uses deep learning models on real-time surveillance camera footages.
Our research proposes a new approach to mask detection by replacing Mask-R-CNN with a more efficient model "YOLO"
arXiv Detail & Related papers (2021-02-10T12:34:47Z) - Improving Self-supervised Pre-training via a Fully-Explored Masked
Language Model [57.77981008219654]
Masked Language Model (MLM) framework has been widely adopted for self-supervised language pre-training.
We propose a fully-explored masking strategy, where a text sequence is divided into a certain number of non-overlapping segments.
arXiv Detail & Related papers (2020-10-12T21:28:14Z) - Masking as an Efficient Alternative to Finetuning for Pretrained
Language Models [49.64561153284428]
We learn selective binary masks for pretrained weights in lieu of modifying them through finetuning.
In intrinsic evaluations, we show that representations computed by masked language models encode information necessary for solving downstream tasks.
arXiv Detail & Related papers (2020-04-26T15:03:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.