On Sampling-Based Training Criteria for Neural Language Modeling
- URL: http://arxiv.org/abs/2104.10507v1
- Date: Wed, 21 Apr 2021 12:55:52 GMT
- Title: On Sampling-Based Training Criteria for Neural Language Modeling
- Authors: Yingbo Gao, David Thulke, Alexander Gerstenberger, Khoa Viet Tran,
Ralf Schl\"uter, Hermann Ney
- Abstract summary: We consider Monte Carlo sampling, importance sampling, a novel method we call compensated partial summation, and noise contrastive estimation.
We show that all these sampling methods can perform equally well, as long as we correct for the intended class posterior probabilities.
Experimental results in language modeling and automatic speech recognition on Switchboard and LibriSpeech support our claim.
- Score: 97.35284042981675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As the vocabulary size of modern word-based language models becomes ever
larger, many sampling-based training criteria are proposed and investigated.
The essence of these sampling methods is that the softmax-related traversal
over the entire vocabulary can be simplified, giving speedups compared to the
baseline. A problem we notice about the current landscape of such sampling
methods is the lack of a systematic comparison and some myths about preferring
one over another. In this work, we consider Monte Carlo sampling, importance
sampling, a novel method we call compensated partial summation, and noise
contrastive estimation. Linking back to the three traditional criteria, namely
mean squared error, binary cross-entropy, and cross-entropy, we derive the
theoretical solutions to the training problems. Contrary to some common belief,
we show that all these sampling methods can perform equally well, as long as we
correct for the intended class posterior probabilities. Experimental results in
language modeling and automatic speech recognition on Switchboard and
LibriSpeech support our claim, with all sampling-based methods showing similar
perplexities and word error rates while giving the expected speedups.
Related papers
- Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple
Logits Retargeting Approach [102.0769560460338]
We develop a simple logits approach (LORT) without the requirement of prior knowledge of the number of samples per class.
Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018.
arXiv Detail & Related papers (2024-03-01T03:27:08Z) - Unsupervised Out-of-Distribution Dialect Detection with Mahalanobis
Distance [6.358196724648596]
A deployed dialect classification model can encounter anomalous inputs that differ from the training data distribution.
Out-of-distribution detection is a new research area that has received little attention in the context of dialect classification.
We propose a simple yet effective unsupervised Mahalanobis distance feature-based method to detect out-of-distribution samples.
arXiv Detail & Related papers (2023-08-09T11:33:53Z) - Can Diffusion Model Achieve Better Performance in Text Generation?
Bridging the Gap between Training and Inference! [14.979893207094221]
Diffusion models have been successfully adapted to text generation tasks by mapping the discrete text into the continuous space.
There exist nonnegligible gaps between training and inference, owing to the absence of the forward process during inference.
We propose two simple yet effective methods to bridge the gaps mentioned above, named Distance Penalty and Adaptive Decay Sampling.
arXiv Detail & Related papers (2023-05-08T05:32:22Z) - Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models [95.97506031821217]
We present a novel way of conditioning a pretrained denoising diffusion speech model to produce speech in the voice of a novel person unseen during training.
The method requires a short (3 seconds) sample from the target person, and generation is steered at inference time, without any training steps.
arXiv Detail & Related papers (2022-06-05T19:45:29Z) - Probing BERT's priors with serial reproduction chains [8.250374560598493]
We use serial reproduction chains to probe BERT's priors.
A unique and consistent estimator of the ground-truth joint distribution may be obtained.
We compare the lexical and syntactic statistics of sentences from the resulting prior distribution against those of the ground-truth corpus distribution.
arXiv Detail & Related papers (2022-02-24T17:42:28Z) - Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step.
We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z) - Empowering Language Understanding with Counterfactual Reasoning [141.48592718583245]
We propose a Counterfactual Reasoning Model, which mimics the counterfactual thinking by learning from few counterfactual samples.
In particular, we devise a generation module to generate representative counterfactual samples for each factual sample, and a retrospective module to retrospect the model prediction by comparing the counterfactual and factual samples.
arXiv Detail & Related papers (2021-06-06T06:36:52Z) - Jo-SRC: A Contrastive Approach for Combating Noisy Labels [58.867237220886885]
We propose a noise-robust approach named Jo-SRC (Joint Sample Selection and Model Regularization based on Consistency)
Specifically, we train the network in a contrastive learning manner. Predictions from two different views of each sample are used to estimate its "likelihood" of being clean or out-of-distribution.
arXiv Detail & Related papers (2021-03-24T07:26:07Z) - $k$-Neighbor Based Curriculum Sampling for Sequence Prediction [22.631763991832862]
Multi-step ahead prediction in language models is challenging due to discrepancy between training and test time processes.
We propose textitNearest-Neighbor Replacement Sampling -- a curriculum learning-based method that gradually changes an initially deterministic teacher policy.
We report our findings on two language modelling benchmarks and find that the proposed method further improves performance when used in conjunction with scheduled sampling.
arXiv Detail & Related papers (2021-01-22T20:07:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.