Position Masking for Language Models
- URL: http://arxiv.org/abs/2006.05676v1
- Date: Tue, 2 Jun 2020 23:40:41 GMT
- Title: Position Masking for Language Models
- Authors: Andy Wagner, Tiyasa Mitra, Mrinal Iyer, Godfrey Da Costa, Marc
Tremblay
- Abstract summary: Masked language modeling (MLM) pre-training models such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
We propose to expand upon this idea by masking the positions of some tokens along with the masked input token ids.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked language modeling (MLM) pre-training models such as BERT corrupt the
input by replacing some tokens with [MASK] and then train a model to
reconstruct the original tokens. This is an effective technique which has led
to good results on all NLP benchmarks. We propose to expand upon this idea by
masking the positions of some tokens along with the masked input token ids. We
follow the same standard approach as BERT masking a percentage of the tokens
positions and then predicting their original values using an additional fully
connected classifier stage. This approach has shown good performance gains
(.3\% improvement) for the SQUAD additional improvement in convergence times.
For the Graphcore IPU the convergence of BERT Base with position masking
requires only 50\% of the tokens from the original BERT paper.
Related papers
- SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [93.94454894142413]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP)
SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings.
Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z) - Emerging Property of Masked Token for Effective Pre-training [15.846621577804791]
Masked Image Modeling (MIM) has been instrumental in driving recent breakthroughs in computer vision.
MIM's overall efficiency is occasionally hampered by the lengthy duration of the pre-training phase.
We propose a novel approach termed masked token optimization (MTO), specifically designed to improve model efficiency through weight recalibration and the enhancement of the key property of masked tokens.
arXiv Detail & Related papers (2024-04-12T08:46:53Z) - Object Recognition as Next Token Prediction [99.40793702627396]
We present an approach to pose object recognition as next token prediction.
The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels.
arXiv Detail & Related papers (2023-12-04T18:58:40Z) - Token Fusion: Bridging the Gap between Token Pruning and Token Merging [71.84591084401458]
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs.
computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging.
We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
arXiv Detail & Related papers (2023-12-02T04:29:19Z) - Learning with Unmasked Tokens Drives Stronger Vision Learners [39.752789949834536]
Masked image modeling (MIM) has become a leading self-supervised learning strategy.
We improve MIM by explicitly incorporating unmasked tokens into the training process.
We achieve 84.2% top-1 accuracy with ViT-B on ImageNet-1K with 0.6%p gain.
arXiv Detail & Related papers (2023-10-20T15:42:47Z) - Weighted Sampling for Masked Language Modeling [12.25238763907731]
We propose two simple and effective Weighted Sampling strategies for masking tokens based on the token frequency and training loss.
We apply these two strategies to BERT and obtain Weighted-Sampled BERT (WSBERT)
arXiv Detail & Related papers (2023-02-28T01:07:39Z) - Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs)
We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies.
We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z) - Token Dropping for Efficient BERT Pretraining [33.63507016806947]
We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models.
We leverage the already built-in masked language modeling (MLM) loss to identify unimportant tokens with practically no computational overhead.
This simple approach reduces the pretraining cost of BERT by 25% while achieving similar overall fine-tuning performance on standard downstream tasks.
arXiv Detail & Related papers (2022-03-24T17:50:46Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z) - ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
Generators [108.3381301768299]
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
We propose a more sample-efficient pre-training task called replaced token detection.
arXiv Detail & Related papers (2020-03-23T21:17:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.