Improving Self-supervised Pre-training via a Fully-Explored Masked
Language Model
- URL: http://arxiv.org/abs/2010.06040v2
- Date: Wed, 14 Oct 2020 04:45:59 GMT
- Title: Improving Self-supervised Pre-training via a Fully-Explored Masked
Language Model
- Authors: Mingzhi Zheng, Dinghan Shen, Yelong Shen, Weizhu Chen, Lin Xiao
- Abstract summary: Masked Language Model (MLM) framework has been widely adopted for self-supervised language pre-training.
We propose a fully-explored masking strategy, where a text sequence is divided into a certain number of non-overlapping segments.
- Score: 57.77981008219654
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked Language Model (MLM) framework has been widely adopted for
self-supervised language pre-training. In this paper, we argue that randomly
sampled masks in MLM would lead to undesirably large gradient variance. Thus,
we theoretically quantify the gradient variance via correlating the gradient
covariance with the Hamming distance between two different masks (given a
certain text sequence). To reduce the variance due to the sampling of masks, we
propose a fully-explored masking strategy, where a text sequence is divided
into a certain number of non-overlapping segments. Thereafter, the tokens
within one segment are masked for training. We prove, from a theoretical
perspective, that the gradients derived from this new masking schema have a
smaller variance and can lead to more efficient self-supervised training. We
conduct extensive experiments on both continual pre-training and general
pre-training from scratch. Empirical results confirm that this new masking
strategy can consistently outperform standard random masking. Detailed
efficiency analysis and ablation studies further validate the advantages of our
fully-explored masking strategy under the MLM framework.
Related papers
- Bridge the Points: Graph-based Few-shot Segment Anything Semantically [79.1519244940518]
Recent advancements in pre-training techniques have enhanced the capabilities of vision foundation models.
Recent studies extend the SAM to Few-shot Semantic segmentation (FSS)
We propose a simple yet effective approach based on graph analysis.
arXiv Detail & Related papers (2024-10-09T15:02:28Z) - Difference-Masking: Choosing What to Mask in Continued Pretraining [56.76782116221438]
We introduce Difference-Masking, a masking strategy that automatically chooses what to mask during continued pretraining.
We find that Difference-Masking outperforms baselines on continued pretraining settings across four diverse language-only and multimodal video tasks.
arXiv Detail & Related papers (2023-05-23T23:31:02Z) - Uniform Masking Prevails in Vision-Language Pretraining [26.513450527203453]
Masked Language Modeling (MLM) has proven to be an essential component of Vision-Language (VL) pretraining.
This paper shows that increasing the masking rate leads to gains in Image-Text Matching (ITM) tasks.
arXiv Detail & Related papers (2022-12-10T04:02:19Z) - Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning.
We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z) - Learning Better Masking for Better Language Model Pre-training [80.31112722910787]
Masked Language Modeling has been widely used as denoising objective in pre-training language models (PrLMs)
PrLMs commonly adopt a Random-Token Masking strategy where a fixed masking ratio is applied and different contents are masked by an equal probability throughout the entire training.
We propose two scheduled masking approaches that adaptively tune the masking ratio and masked content in different training stages.
arXiv Detail & Related papers (2022-08-23T08:27:52Z) - On the Inductive Bias of Masked Language Modeling: From Statistical to
Syntactic Dependencies [8.370942516424817]
Masking and predicting tokens in an unsupervised fashion can give rise linguistic structures and downstream performance gains.
Recent theories have suggested that pretrained language models acquire useful inductive biases through masks that implicitly act as cloze reductions.
We show that the success of the random masking strategy used in practice cannot be explained by such cloze-like masks alone.
arXiv Detail & Related papers (2021-04-12T17:55:27Z) - PMI-Masking: Principled masking of correlated spans [46.36098771676867]
Masking tokens uniformly at random constitutes a common flaw in the pretraining of Masked Language Models (MLMs)
We propose PMI-Masking, a principled masking strategy based on the concept of Pointwise Mutual Information (PMI)
We show experimentally that PMI-Masking reaches the performance of prior masking approaches in half the training time, and consistently improves performance at the end of training.
arXiv Detail & Related papers (2020-10-05T07:19:52Z) - Variance-reduced Language Pretraining via a Mask Proposal Network [5.819397109258169]
Self-supervised learning, a.k.a., pretraining, is important in natural language processing.
In this paper, we tackle the problem from the view of gradient variance reduction.
To improve efficiency, we introduced a MAsk Network (MAPNet), which approximates the optimal mask proposal distribution.
arXiv Detail & Related papers (2020-08-12T14:12:32Z) - Semi-Autoregressive Training Improves Mask-Predict Decoding [119.8412758943192]
We introduce a new training method for conditional masked language models, SMART, which mimics the semi-autoregressive behavior of mask-predict.
Models trained with SMART produce higher-quality translations when using mask-predict decoding, effectively closing the remaining performance gap with fully autoregressive models.
arXiv Detail & Related papers (2020-01-23T19:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.