Related papers: Position Masking for Language Models

Position Masking for Language Models

URL: http://arxiv.org/abs/2006.05676v1
Date: Tue, 2 Jun 2020 23:40:41 GMT
Title: Position Masking for Language Models
Authors: Andy Wagner, Tiyasa Mitra, Mrinal Iyer, Godfrey Da Costa, Marc Tremblay
Abstract summary: Masked language modeling (MLM) pre-training models such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. We propose to expand upon this idea by masking the positions of some tokens along with the masked input token ids.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Masked language modeling (MLM) pre-training models such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. This is an effective technique which has led to good results on all NLP benchmarks. We propose to expand upon this idea by masking the positions of some tokens along with the masked input token ids. We follow the same standard approach as BERT masking a percentage of the tokens positions and then predicting their original values using an additional fully connected classifier stage. This approach has shown good performance gains (.3\% improvement) for the SQUAD additional improvement in convergence times. For the Graphcore IPU the convergence of BERT Base with position masking requires only 50\% of the tokens from the original BERT paper.

Related papers

SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [93.94454894142413]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP) SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings. Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z)
Emerging Property of Masked Token for Effective Pre-training [15.846621577804791]
Masked Image Modeling (MIM) has been instrumental in driving recent breakthroughs in computer vision. MIM's overall efficiency is occasionally hampered by the lengthy duration of the pre-training phase. We propose a novel approach termed masked token optimization (MTO), specifically designed to improve model efficiency through weight recalibration and the enhancement of the key property of masked tokens.
arXiv Detail & Related papers (2024-04-12T08:46:53Z)
Object Recognition as Next Token Prediction [99.40793702627396]
We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels.
arXiv Detail & Related papers (2023-12-04T18:58:40Z)
Token Fusion: Bridging the Gap between Token Pruning and Token Merging [71.84591084401458]
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs. computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging. We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
arXiv Detail & Related papers (2023-12-02T04:29:19Z)
Learning with Unmasked Tokens Drives Stronger Vision Learners [39.752789949834536]
Masked image modeling (MIM) has become a leading self-supervised learning strategy. We improve MIM by explicitly incorporating unmasked tokens into the training process. We achieve 84.2% top-1 accuracy with ViT-B on ImageNet-1K with 0.6%p gain.
arXiv Detail & Related papers (2023-10-20T15:42:47Z)
Weighted Sampling for Masked Language Modeling [12.25238763907731]
We propose two simple and effective Weighted Sampling strategies for masking tokens based on the token frequency and training loss. We apply these two strategies to BERT and obtain Weighted-Sampled BERT (WSBERT)
arXiv Detail & Related papers (2023-02-28T01:07:39Z)
Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs) We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies. We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z)
Token Dropping for Efficient BERT Pretraining [33.63507016806947]
We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models. We leverage the already built-in masked language modeling (MLM) loss to identify unimportant tokens with practically no computational overhead. This simple approach reduces the pretraining cost of BERT by 25% while achieving similar overall fine-tuning performance on standard downstream tasks.
arXiv Detail & Related papers (2022-03-24T17:50:46Z)
Fast End-to-End Speech Recognition via a Non-Autoregressive Model and Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once) The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators [108.3381301768299]
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. We propose a more sample-efficient pre-training task called replaced token detection.
arXiv Detail & Related papers (2020-03-23T21:17:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.