Related papers: Variance-reduced Language Pretraining via a Mask Proposal Network

Variance-reduced Language Pretraining via a Mask Proposal Network

URL: http://arxiv.org/abs/2008.05333v2
Date: Sun, 16 Aug 2020 15:40:33 GMT
Title: Variance-reduced Language Pretraining via a Mask Proposal Network
Authors: Liang Chen
Abstract summary: Self-supervised learning, a.k.a., pretraining, is important in natural language processing. In this paper, we tackle the problem from the view of gradient variance reduction. To improve efficiency, we introduced a MAsk Network (MAPNet), which approximates the optimal mask proposal distribution.
Score: 5.819397109258169
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-supervised learning, a.k.a., pretraining, is important in natural language processing. Most of the pretraining methods first randomly mask some positions in a sentence and then train a model to recover the tokens at the masked positions. In such a way, the model can be trained without human labeling, and the massive data can be used with billion parameters. Therefore, the optimization efficiency becomes critical. In this paper, we tackle the problem from the view of gradient variance reduction. In particular, we first propose a principled gradient variance decomposition theorem, which shows that the variance of the stochastic gradient of the language pretraining can be naturally decomposed into two terms: the variance that arises from the sample of data in a batch, and the variance that arises from the sampling of the mask. The second term is the key difference between selfsupervised learning and supervised learning, which makes the pretraining slower. In order to reduce the variance of the second part, we leverage the importance sampling strategy, which aims at sampling the masks according to a proposal distribution instead of the uniform distribution. It can be shown that if the proposal distribution is proportional to the gradient norm, the variance of the sampling is reduced. To improve efficiency, we introduced a MAsk Proposal Network (MAPNet), which approximates the optimal mask proposal distribution and is trained end-to-end along with the model. According to the experimental result, our model converges much faster and achieves higher performance than the baseline BERT model.

Related papers

DistPred: A Distribution-Free Probabilistic Inference Method for Regression and Forecasting [14.390842560217743]
We propose a novel approach called DistPred for regression and forecasting tasks. We transform proper scoring rules that measure the discrepancy between the predicted distribution and the target distribution into a differentiable discrete form. This allows the model to sample numerous samples in a single forward pass to estimate the potential distribution of the response variable.
arXiv Detail & Related papers (2024-06-17T10:33:00Z)
Rejection via Learning Density Ratios [50.91522897152437]
Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions. We propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance. Our framework is tested empirically over clean and noisy datasets.
arXiv Detail & Related papers (2024-05-29T01:32:17Z)
TransFusion: Covariate-Shift Robust Transfer Learning for High-Dimensional Regression [11.040033344386366]
We propose a two-step method with a novel fused-regularizer to improve the learning performance on a target task with limited samples. Nonasymptotic bound is provided for the estimation error of the target model. We extend the method to a distributed setting, allowing for a pretraining-finetuning strategy.
arXiv Detail & Related papers (2024-04-01T14:58:16Z)
Probabilistic Contrastive Learning for Long-Tailed Visual Recognition [78.70453964041718]
Longtailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples. Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance. We propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space.
arXiv Detail & Related papers (2024-03-11T13:44:49Z)
Learning Distributions via Monte-Carlo Marginalization [9.131712404284876]
We propose a novel method to learn intractable distributions from their samples. The Monte-Carlo Marginalization (MCMarg) is proposed to address this issue. The proposed approach is a powerful tool to learn complex distributions and the entire process is differentiable.
arXiv Detail & Related papers (2023-08-11T19:08:06Z)
Distribution Mismatch Correction for Improved Robustness in Deep Neural Networks [86.42889611784855]
normalization methods increase the vulnerability with respect to noise and input corruptions. We propose an unsupervised non-parametric distribution correction method that adapts the activation distribution of each layer. In our experiments, we empirically show that the proposed method effectively reduces the impact of intense image corruptions.
arXiv Detail & Related papers (2021-10-05T11:36:25Z)
Improving Self-supervised Pre-training via a Fully-Explored Masked Language Model [57.77981008219654]
Masked Language Model (MLM) framework has been widely adopted for self-supervised language pre-training. We propose a fully-explored masking strategy, where a text sequence is divided into a certain number of non-overlapping segments.
arXiv Detail & Related papers (2020-10-12T21:28:14Z)
What causes the test error? Going beyond bias-variance via ANOVA [21.359033212191218]
Modern machine learning methods are often overparametrized, allowing adaptation to the data at a fine level. Recent work aimed to understand in greater depth why overparametrization is helpful for generalization. We propose using the analysis of variance (ANOVA) to decompose the variance in the test error in a symmetric way.
arXiv Detail & Related papers (2020-10-11T05:21:13Z)
A One-step Approach to Covariate Shift Adaptation [82.01909503235385]
A default assumption in many machine learning scenarios is that the training and test samples are drawn from the same probability distribution. We propose a novel one-step approach that jointly learns the predictive model and the associated weights in one optimization.
arXiv Detail & Related papers (2020-07-08T11:35:47Z)
Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks. We introduce a new scoring method that casts a plausibility ranking task in a full-text format. We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.