Variance-reduced Language Pretraining via a Mask Proposal Network
- URL: http://arxiv.org/abs/2008.05333v2
- Date: Sun, 16 Aug 2020 15:40:33 GMT
- Title: Variance-reduced Language Pretraining via a Mask Proposal Network
- Authors: Liang Chen
- Abstract summary: Self-supervised learning, a.k.a., pretraining, is important in natural language processing.
In this paper, we tackle the problem from the view of gradient variance reduction.
To improve efficiency, we introduced a MAsk Network (MAPNet), which approximates the optimal mask proposal distribution.
- Score: 5.819397109258169
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning, a.k.a., pretraining, is important in natural
language processing. Most of the pretraining methods first randomly mask some
positions in a sentence and then train a model to recover the tokens at the
masked positions. In such a way, the model can be trained without human
labeling, and the massive data can be used with billion parameters. Therefore,
the optimization efficiency becomes critical. In this paper, we tackle the
problem from the view of gradient variance reduction. In particular, we first
propose a principled gradient variance decomposition theorem, which shows that
the variance of the stochastic gradient of the language pretraining can be
naturally decomposed into two terms: the variance that arises from the sample
of data in a batch, and the variance that arises from the sampling of the mask.
The second term is the key difference between selfsupervised learning and
supervised learning, which makes the pretraining slower. In order to reduce the
variance of the second part, we leverage the importance sampling strategy,
which aims at sampling the masks according to a proposal distribution instead
of the uniform distribution. It can be shown that if the proposal distribution
is proportional to the gradient norm, the variance of the sampling is reduced.
To improve efficiency, we introduced a MAsk Proposal Network (MAPNet), which
approximates the optimal mask proposal distribution and is trained end-to-end
along with the model. According to the experimental result, our model converges
much faster and achieves higher performance than the baseline BERT model.
Related papers
- DistPred: A Distribution-Free Probabilistic Inference Method for Regression and Forecasting [14.390842560217743]
We propose a novel approach called DistPred for regression and forecasting tasks.
We transform proper scoring rules that measure the discrepancy between the predicted distribution and the target distribution into a differentiable discrete form.
This allows the model to sample numerous samples in a single forward pass to estimate the potential distribution of the response variable.
arXiv Detail & Related papers (2024-06-17T10:33:00Z) - Rejection via Learning Density Ratios [50.91522897152437]
Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions.
We propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance.
Our framework is tested empirically over clean and noisy datasets.
arXiv Detail & Related papers (2024-05-29T01:32:17Z) - TransFusion: Covariate-Shift Robust Transfer Learning for High-Dimensional Regression [11.040033344386366]
We propose a two-step method with a novel fused-regularizer to improve the learning performance on a target task with limited samples.
Nonasymptotic bound is provided for the estimation error of the target model.
We extend the method to a distributed setting, allowing for a pretraining-finetuning strategy.
arXiv Detail & Related papers (2024-04-01T14:58:16Z) - Probabilistic Contrastive Learning for Long-Tailed Visual Recognition [78.70453964041718]
Longtailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples.
Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance.
We propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space.
arXiv Detail & Related papers (2024-03-11T13:44:49Z) - Learning Distributions via Monte-Carlo Marginalization [9.131712404284876]
We propose a novel method to learn intractable distributions from their samples.
The Monte-Carlo Marginalization (MCMarg) is proposed to address this issue.
The proposed approach is a powerful tool to learn complex distributions and the entire process is differentiable.
arXiv Detail & Related papers (2023-08-11T19:08:06Z) - Distribution Mismatch Correction for Improved Robustness in Deep Neural
Networks [86.42889611784855]
normalization methods increase the vulnerability with respect to noise and input corruptions.
We propose an unsupervised non-parametric distribution correction method that adapts the activation distribution of each layer.
In our experiments, we empirically show that the proposed method effectively reduces the impact of intense image corruptions.
arXiv Detail & Related papers (2021-10-05T11:36:25Z) - Improving Self-supervised Pre-training via a Fully-Explored Masked
Language Model [57.77981008219654]
Masked Language Model (MLM) framework has been widely adopted for self-supervised language pre-training.
We propose a fully-explored masking strategy, where a text sequence is divided into a certain number of non-overlapping segments.
arXiv Detail & Related papers (2020-10-12T21:28:14Z) - What causes the test error? Going beyond bias-variance via ANOVA [21.359033212191218]
Modern machine learning methods are often overparametrized, allowing adaptation to the data at a fine level.
Recent work aimed to understand in greater depth why overparametrization is helpful for generalization.
We propose using the analysis of variance (ANOVA) to decompose the variance in the test error in a symmetric way.
arXiv Detail & Related papers (2020-10-11T05:21:13Z) - A One-step Approach to Covariate Shift Adaptation [82.01909503235385]
A default assumption in many machine learning scenarios is that the training and test samples are drawn from the same probability distribution.
We propose a novel one-step approach that jointly learns the predictive model and the associated weights in one optimization.
arXiv Detail & Related papers (2020-07-08T11:35:47Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.