Characterizing and addressing the issue of oversmoothing in neural
autoregressive sequence modeling
- URL: http://arxiv.org/abs/2112.08914v1
- Date: Thu, 16 Dec 2021 14:33:12 GMT
- Title: Characterizing and addressing the issue of oversmoothing in neural
autoregressive sequence modeling
- Authors: Ilia Kulikov, Maksim Eremeev, Kyunghyun Cho
- Abstract summary: We study the effect of the proposed regularization on both model distribution and decoding performance.
We conclude that the high degree of oversmoothing is the main reason behind the case of overly probable short sequences in a neural autoregressive model.
- Score: 49.06391831200667
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural autoregressive sequence models smear the probability among many
possible sequences including degenerate ones, such as empty or repetitive
sequences. In this work, we tackle one specific case where the model assigns a
high probability to unreasonably short sequences. We define the oversmoothing
rate to quantify this issue. After confirming the high degree of oversmoothing
in neural machine translation, we propose to explicitly minimize the
oversmoothing rate during training. We conduct a set of experiments to study
the effect of the proposed regularization on both model distribution and
decoding performance. We use a neural machine translation task as the testbed
and consider three different datasets of varying size. Our experiments reveal
three major findings. First, we can control the oversmoothing rate of the model
by tuning the strength of the regularization. Second, by enhancing the
oversmoothing loss contribution, the probability and the rank of <eos> token
decrease heavily at positions where it is not supposed to be. Third, the
proposed regularization impacts the outcome of beam search especially when a
large beam is used. The degradation of translation quality (measured in BLEU)
with a large beam significantly lessens with lower oversmoothing rate, but the
degradation compared to smaller beam sizes remains to exist. From these
observations, we conclude that the high degree of oversmoothing is the main
reason behind the degenerate case of overly probable short sequences in a
neural autoregressive model.
Related papers
- The Surprising Harmfulness of Benign Overfitting for Adversarial
Robustness [13.120373493503772]
We prove a surprising result that even if the ground truth itself is robust to adversarial examples, the benignly overfitted model is benign in terms of the standard'' out-of-sample risk objective.
Our finding provides theoretical insights into the puzzling phenomenon observed in practice, where the true target function (e.g., human) is robust against adverasrial attack, while beginly overfitted neural networks lead to models that are not robust.
arXiv Detail & Related papers (2024-01-19T15:40:46Z) - LARA: A Light and Anti-overfitting Retraining Approach for Unsupervised
Time Series Anomaly Detection [49.52429991848581]
We propose a Light and Anti-overfitting Retraining Approach (LARA) for deep variational auto-encoder based time series anomaly detection methods (VAEs)
This work aims to make three novel contributions: 1) the retraining process is formulated as a convex problem and can converge at a fast rate as well as prevent overfitting; 2) designing a ruminate block, which leverages the historical data without the need to store them; and 3) mathematically proving that when fine-tuning the latent vector and reconstructed data, the linear formations can achieve the least adjusting errors between the ground truths and the fine-tuned ones.
arXiv Detail & Related papers (2023-10-09T12:36:16Z) - Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures [93.17009514112702]
Pruning, setting a significant subset of the parameters of a neural network to zero, is one of the most popular methods of model compression.
Despite existing evidence for this phenomenon, the relationship between neural network pruning and induced bias is not well-understood.
arXiv Detail & Related papers (2023-04-25T07:42:06Z) - Cross-Model Comparative Loss for Enhancing Neuronal Utility in Language
Understanding [82.46024259137823]
We propose a cross-model comparative loss for a broad range of tasks.
We demonstrate the universal effectiveness of comparative loss through extensive experiments on 14 datasets from 3 distinct NLU tasks.
arXiv Detail & Related papers (2023-01-10T03:04:27Z) - On the Role of Optimization in Double Descent: A Least Squares Study [30.44215064390409]
We show an excess risk bound for the descent gradient solution of the least squares objective.
We find that in case of noiseless regression, double descent is explained solely by optimization-related quantities.
We empirically explore if our predictions hold for neural networks.
arXiv Detail & Related papers (2021-07-27T09:13:11Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Efficient Causal Inference from Combined Observational and
Interventional Data through Causal Reductions [68.6505592770171]
Unobserved confounding is one of the main challenges when estimating causal effects.
We propose a novel causal reduction method that replaces an arbitrary number of possibly high-dimensional latent confounders.
We propose a learning algorithm to estimate the parameterized reduced model jointly from observational and interventional data.
arXiv Detail & Related papers (2021-03-08T14:29:07Z) - Point process models for sequence detection in high-dimensional neural
spike trains [29.073129195368235]
We develop a point process model that characterizes fine-scale sequences at the level of individual spikes.
This ultra-sparse representation of sequence events opens new possibilities for spike train modeling.
arXiv Detail & Related papers (2020-10-10T02:21:44Z) - The Neural Tangent Kernel in High Dimensions: Triple Descent and a
Multi-Scale Theory of Generalization [34.235007566913396]
Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well.
An emerging paradigm for describing this unexpected behavior is in terms of a emphdouble descent curve.
We provide a precise high-dimensional analysis of generalization with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks with gradient descent.
arXiv Detail & Related papers (2020-08-15T20:55:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.