Generalized Entropy Regularization or: There's Nothing Special about
Label Smoothing
- URL: http://arxiv.org/abs/2005.00820v2
- Date: Tue, 12 May 2020 06:22:06 GMT
- Title: Generalized Entropy Regularization or: There's Nothing Special about
Label Smoothing
- Authors: Clara Meister, Elizabeth Salesky, Ryan Cotterell
- Abstract summary: We introduce a family of entropy regularizers, which includes label smoothing as a special case.
We find that variance in model performance can be explained largely by the resulting entropy of the model.
We advise the use of other entropy regularization methods in its place.
- Score: 83.78668073898001
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prior work has explored directly regularizing the output distributions of
probabilistic models to alleviate peaky (i.e. over-confident) predictions, a
common sign of overfitting. This class of techniques, of which label smoothing
is one, has a connection to entropy regularization. Despite the consistent
success of label smoothing across architectures and data sets in language
generation tasks, two problems remain open: (1) there is little understanding
of the underlying effects entropy regularizers have on models, and (2) the full
space of entropy regularization techniques is largely unexplored. We introduce
a parametric family of entropy regularizers, which includes label smoothing as
a special case, and use it to gain a better understanding of the relationship
between the entropy of a model and its performance on language generation
tasks. We also find that variance in model performance can be explained largely
by the resulting entropy of the model. Lastly, we find that label smoothing
provably does not allow for sparsity in an output distribution, an undesirable
property for language generation models, and therefore advise the use of other
entropy regularization methods in its place.
Related papers
- A Complete Decomposition of KL Error using Refined Information and Mode Interaction Selection [11.994525728378603]
We revisit the classical formulation of the log-linear model with a focus on higher-order mode interactions.
We find that our learned distributions are able to more efficiently use the finite amount of data which is available in practice.
arXiv Detail & Related papers (2024-10-15T18:08:32Z) - An Entropy-Based Test and Development Framework for Uncertainty Modeling in Level-Set Visualizations [2.5449631655313896]
We use an entropy calculation directly on ensemble data to establish an expected result.
We show that fewer bins in nonparametric histogram models are more effective whereas large numbers of bins in quantile models approach data accuracy.
arXiv Detail & Related papers (2024-09-13T00:31:16Z) - Mutual Exclusivity Training and Primitive Augmentation to Induce
Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models.
We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples.
We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z) - Adaptive Label Smoothing with Self-Knowledge in Natural Language
Generation [16.878277421402945]
We propose a regularization scheme that brings dynamic nature into the smoothing parameter.
A model in training self-regulates the extent of smoothing on the fly during forward propagation.
arXiv Detail & Related papers (2022-10-22T11:52:38Z) - Neuro-Symbolic Entropy Regularization [78.16196949641079]
In structured prediction, the goal is to jointly predict many output variables that together encode a structured object.
One approach -- entropy regularization -- posits that decision boundaries should lie in low-probability regions.
We propose a loss, neuro-symbolic entropy regularization, that encourages the model to confidently predict a valid object.
arXiv Detail & Related papers (2022-01-25T06:23:10Z) - Squared $\ell_2$ Norm as Consistency Loss for Leveraging Augmented Data
to Learn Robust and Invariant Representations [76.85274970052762]
Regularizing distance between embeddings/representations of original samples and augmented counterparts is a popular technique for improving robustness of neural networks.
In this paper, we explore these various regularization choices, seeking to provide a general understanding of how we should regularize the embeddings.
We show that the generic approach we identified (squared $ell$ regularized augmentation) outperforms several recent methods, which are each specially designed for one task.
arXiv Detail & Related papers (2020-11-25T22:40:09Z) - Understanding Double Descent Requires a Fine-Grained Bias-Variance
Decomposition [34.235007566913396]
We describe an interpretable, symmetric decomposition of the variance into terms associated with the labels.
We find that the bias decreases monotonically with the network width, but the variance terms exhibit non-monotonic behavior.
We also analyze the strikingly rich phenomenology that arises.
arXiv Detail & Related papers (2020-11-04T21:04:02Z) - Removing Bias in Multi-modal Classifiers: Regularization by Maximizing
Functional Entropies [88.0813215220342]
Some modalities can more easily contribute to the classification results than others.
We develop a method based on the log-Sobolev inequality, which bounds the functional entropy with the functional-Fisher-information.
On the two challenging multi-modal datasets VQA-CPv2 and SocialIQ, we obtain state-of-the-art results while more uniformly exploiting the modalities.
arXiv Detail & Related papers (2020-10-21T07:40:33Z) - Flexible mean field variational inference using mixtures of
non-overlapping exponential families [6.599344783327053]
I show that using standard mean field variational inference can fail to produce sensible results for models with sparsity-inducing priors.
I show that any mixture of a diffuse exponential family and a point mass at zero to model sparsity forms an exponential family.
arXiv Detail & Related papers (2020-10-14T01:46:56Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.