Weighted Sampling for Masked Language Modeling
- URL: http://arxiv.org/abs/2302.14225v2
- Date: Wed, 24 May 2023 04:31:24 GMT
- Title: Weighted Sampling for Masked Language Modeling
- Authors: Linhan Zhang, Qian Chen, Wen Wang, Chong Deng, Xin Cao, Kongzhang Hao,
Yuxin Jiang, Wei Wang
- Abstract summary: We propose two simple and effective Weighted Sampling strategies for masking tokens based on the token frequency and training loss.
We apply these two strategies to BERT and obtain Weighted-Sampled BERT (WSBERT)
- Score: 12.25238763907731
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked Language Modeling (MLM) is widely used to pretrain language models.
The standard random masking strategy in MLM causes the pre-trained language
models (PLMs) to be biased toward high-frequency tokens. Representation
learning of rare tokens is poor and PLMs have limited performance on downstream
tasks. To alleviate this frequency bias issue, we propose two simple and
effective Weighted Sampling strategies for masking tokens based on the token
frequency and training loss. We apply these two strategies to BERT and obtain
Weighted-Sampled BERT (WSBERT). Experiments on the Semantic Textual Similarity
benchmark (STS) show that WSBERT significantly improves sentence embeddings
over BERT. Combining WSBERT with calibration methods and prompt learning
further improves sentence embeddings. We also investigate fine-tuning WSBERT on
the GLUE benchmark and show that Weighted Sampling also improves the transfer
learning capability of the backbone PLM. We further analyze and provide
insights into how WSBERT improves token embeddings.
Related papers
- Alleviating Over-smoothing for Unsupervised Sentence Representation [96.19497378628594]
We present a Simple method named Self-Contrastive Learning (SSCL) to alleviate this issue.
Our proposed method is quite simple and can be easily extended to various state-of-the-art models for performance boosting.
arXiv Detail & Related papers (2023-05-09T11:00:02Z) - Mask-guided BERT for Few Shot Text Classification [12.361032727044547]
Mask-BERT is a simple and modular framework to help BERT-based architectures tackle few-shot learning.
The core idea is to selectively apply masks on text inputs and filter out irrelevant information, which guides the model to focus on discriminative tokens.
Experimental results on public-domain benchmark datasets demonstrate the effectiveness of Mask-BERT.
arXiv Detail & Related papers (2023-02-21T05:24:00Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - PERT: Pre-training BERT with Permuted Language Model [24.92527883997854]
PERT is an auto-encoding model (like BERT) trained with Permuted Language Model (PerLM)
We permute a proportion of the input text, and the training objective is to predict the position of the original token.
We carried out extensive experiments on both Chinese and English NLU benchmarks.
arXiv Detail & Related papers (2022-03-14T07:58:34Z) - BERT for Sentiment Analysis: Pre-trained and Fine-Tuned Alternatives [0.0]
BERT has revolutionized the NLP field by enabling transfer learning with large language models.
This article studies how to better cope with the different embeddings provided by the BERT output layer and the usage of language-specific instead of multilingual models.
arXiv Detail & Related papers (2022-01-10T15:05:05Z) - TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning [19.682704309037653]
Masked language models (MLMs) have revolutionized the field of Natural Language Understanding.
We propose TaCL (Token-aware Contrastive Learning), a novel continual pre-training approach that encourages BERT to learn an isotropic and discriminative distribution of token representations.
arXiv Detail & Related papers (2021-11-07T22:54:23Z) - Frustratingly Simple Pretraining Alternatives to Masked Language
Modeling [10.732163031244651]
Masked language modeling (MLM) is widely used in natural language processing for learning text representations.
In this paper, we explore five simple pretraining objectives based on token-level classification tasks as replacements of representations.
arXiv Detail & Related papers (2021-09-04T08:52:37Z) - TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference [54.791572981834435]
Existing pre-trained language models (PLMs) are often computationally expensive in inference.
We propose a dynamic token reduction approach to accelerate PLMs' inference, named TR-BERT.
TR-BERT formulates the token reduction process as a multi-step token selection problem and automatically learns the selection strategy via reinforcement learning.
arXiv Detail & Related papers (2021-05-25T02:28:51Z) - BERT-ATTACK: Adversarial Attack Against BERT Using BERT [77.82947768158132]
Adrial attacks for discrete data (such as texts) are more challenging than continuous data (such as images)
We propose textbfBERT-Attack, a high-quality and effective method to generate adversarial samples.
Our method outperforms state-of-the-art attack strategies in both success rate and perturb percentage.
arXiv Detail & Related papers (2020-04-21T13:30:02Z) - ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
Generators [108.3381301768299]
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
We propose a more sample-efficient pre-training task called replaced token detection.
arXiv Detail & Related papers (2020-03-23T21:17:42Z) - Incorporating BERT into Neural Machine Translation [251.54280200353674]
We propose a new algorithm named BERT-fused model, in which we first use BERT to extract representations for an input sequence.
We conduct experiments on supervised (including sentence-level and document-level translations), semi-supervised and unsupervised machine translation, and achieve state-of-the-art results on seven benchmark datasets.
arXiv Detail & Related papers (2020-02-17T08:13:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.