PMI-Masking: Principled masking of correlated spans
- URL: http://arxiv.org/abs/2010.01825v1
- Date: Mon, 5 Oct 2020 07:19:52 GMT
- Title: PMI-Masking: Principled masking of correlated spans
- Authors: Yoav Levine, Barak Lenz, Opher Lieber, Omri Abend, Kevin Leyton-Brown,
Moshe Tennenholtz, Yoav Shoham
- Abstract summary: Masking tokens uniformly at random constitutes a common flaw in the pretraining of Masked Language Models (MLMs)
We propose PMI-Masking, a principled masking strategy based on the concept of Pointwise Mutual Information (PMI)
We show experimentally that PMI-Masking reaches the performance of prior masking approaches in half the training time, and consistently improves performance at the end of training.
- Score: 46.36098771676867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masking tokens uniformly at random constitutes a common flaw in the
pretraining of Masked Language Models (MLMs) such as BERT. We show that such
uniform masking allows an MLM to minimize its training objective by latching
onto shallow local signals, leading to pretraining inefficiency and suboptimal
downstream performance. To address this flaw, we propose PMI-Masking, a
principled masking strategy based on the concept of Pointwise Mutual
Information (PMI), which jointly masks a token n-gram if it exhibits high
collocation over the corpus. PMI-Masking motivates, unifies, and improves upon
prior more heuristic approaches that attempt to address the drawback of random
uniform token masking, such as whole-word masking, entity/phrase masking, and
random-span masking. Specifically, we show experimentally that PMI-Masking
reaches the performance of prior masking approaches in half the training time,
and consistently improves performance at the end of training.
Related papers
- Emerging Property of Masked Token for Effective Pre-training [15.846621577804791]
Masked Image Modeling (MIM) has been instrumental in driving recent breakthroughs in computer vision.
MIM's overall efficiency is occasionally hampered by the lengthy duration of the pre-training phase.
We propose a novel approach termed masked token optimization (MTO), specifically designed to improve model efficiency through weight recalibration and the enhancement of the key property of masked tokens.
arXiv Detail & Related papers (2024-04-12T08:46:53Z) - Salience-Based Adaptive Masking: Revisiting Token Dynamics for Enhanced Pre-training [33.39585710223628]
Saliency-Based Adaptive Masking improves pre-training performance of MIM approaches by prioritizing token salience.
We show that our method significantly improves over the state-of-the-art in mask-based pre-training on the ImageNet-1K dataset.
arXiv Detail & Related papers (2024-04-12T08:38:51Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Difference-Masking: Choosing What to Mask in Continued Pretraining [56.76782116221438]
We introduce Difference-Masking, a masking strategy that automatically chooses what to mask during continued pretraining.
We find that Difference-Masking outperforms baselines on continued pretraining settings across four diverse language-only and multimodal video tasks.
arXiv Detail & Related papers (2023-05-23T23:31:02Z) - Uniform Masking Prevails in Vision-Language Pretraining [26.513450527203453]
Masked Language Modeling (MLM) has proven to be an essential component of Vision-Language (VL) pretraining.
This paper shows that increasing the masking rate leads to gains in Image-Text Matching (ITM) tasks.
arXiv Detail & Related papers (2022-12-10T04:02:19Z) - Learning Better Masking for Better Language Model Pre-training [80.31112722910787]
Masked Language Modeling has been widely used as denoising objective in pre-training language models (PrLMs)
PrLMs commonly adopt a Random-Token Masking strategy where a fixed masking ratio is applied and different contents are masked by an equal probability throughout the entire training.
We propose two scheduled masking approaches that adaptively tune the masking ratio and masked content in different training stages.
arXiv Detail & Related papers (2022-08-23T08:27:52Z) - Self-Supervised Visual Representations Learning by Contrastive Mask
Prediction [129.25459808288025]
We propose a novel contrastive mask prediction (CMP) task for visual representation learning.
MaskCo contrasts region-level features instead of view-level features, which makes it possible to identify the positive sample without any assumptions.
We evaluate MaskCo on training datasets beyond ImageNet and compare its performance with MoCo V2.
arXiv Detail & Related papers (2021-08-18T02:50:33Z) - Improving Self-supervised Pre-training via a Fully-Explored Masked
Language Model [57.77981008219654]
Masked Language Model (MLM) framework has been widely adopted for self-supervised language pre-training.
We propose a fully-explored masking strategy, where a text sequence is divided into a certain number of non-overlapping segments.
arXiv Detail & Related papers (2020-10-12T21:28:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.