Learning Better Masking for Better Language Model Pre-training
- URL: http://arxiv.org/abs/2208.10806v3
- Date: Thu, 25 May 2023 09:05:53 GMT
- Title: Learning Better Masking for Better Language Model Pre-training
- Authors: Dongjie Yang, Zhuosheng Zhang, Hai Zhao
- Abstract summary: Masked Language Modeling has been widely used as denoising objective in pre-training language models (PrLMs)
PrLMs commonly adopt a Random-Token Masking strategy where a fixed masking ratio is applied and different contents are masked by an equal probability throughout the entire training.
We propose two scheduled masking approaches that adaptively tune the masking ratio and masked content in different training stages.
- Score: 80.31112722910787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked Language Modeling (MLM) has been widely used as the denoising
objective in pre-training language models (PrLMs). Existing PrLMs commonly
adopt a Random-Token Masking strategy where a fixed masking ratio is applied
and different contents are masked by an equal probability throughout the entire
training. However, the model may receive complicated impact from pre-training
status, which changes accordingly as training time goes on. In this paper, we
show that such time-invariant MLM settings on masking ratio and masked content
are unlikely to deliver an optimal outcome, which motivates us to explore the
influence of time-variant MLM settings. We propose two scheduled masking
approaches that adaptively tune the masking ratio and masked content in
different training stages, which improves the pre-training efficiency and
effectiveness verified on the downstream tasks. Our work is a pioneer study on
time-variant masking strategy on ratio and content and gives a better
understanding of how masking ratio and masked content influence the MLM
pre-training.
Related papers
- ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework.
We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise.
We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z) - CL-MAE: Curriculum-Learned Masked Autoencoders [49.24994655813455]
We propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task.
We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE.
arXiv Detail & Related papers (2023-08-31T09:13:30Z) - Masked and Permuted Implicit Context Learning for Scene Text Recognition [8.742571493814326]
Scene Recognition (STR) is difficult because of variations in text styles, shapes, and backgrounds.
We propose a masked and permuted implicit context learning network for STR, within a single decoder.
arXiv Detail & Related papers (2023-05-25T15:31:02Z) - Difference-Masking: Choosing What to Mask in Continued Pretraining [56.76782116221438]
We introduce Difference-Masking, a masking strategy that automatically chooses what to mask during continued pretraining.
We find that Difference-Masking outperforms baselines on continued pretraining settings across four diverse language-only and multimodal video tasks.
arXiv Detail & Related papers (2023-05-23T23:31:02Z) - Uniform Masking Prevails in Vision-Language Pretraining [26.513450527203453]
Masked Language Modeling (MLM) has proven to be an essential component of Vision-Language (VL) pretraining.
This paper shows that increasing the masking rate leads to gains in Image-Text Matching (ITM) tasks.
arXiv Detail & Related papers (2022-12-10T04:02:19Z) - Improving Self-supervised Pre-training via a Fully-Explored Masked
Language Model [57.77981008219654]
Masked Language Model (MLM) framework has been widely adopted for self-supervised language pre-training.
We propose a fully-explored masking strategy, where a text sequence is divided into a certain number of non-overlapping segments.
arXiv Detail & Related papers (2020-10-12T21:28:14Z) - PMI-Masking: Principled masking of correlated spans [46.36098771676867]
Masking tokens uniformly at random constitutes a common flaw in the pretraining of Masked Language Models (MLMs)
We propose PMI-Masking, a principled masking strategy based on the concept of Pointwise Mutual Information (PMI)
We show experimentally that PMI-Masking reaches the performance of prior masking approaches in half the training time, and consistently improves performance at the end of training.
arXiv Detail & Related papers (2020-10-05T07:19:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.