Investigation of Large-Margin Softmax in Neural Language Modeling
- URL: http://arxiv.org/abs/2005.10089v2
- Date: Wed, 21 Apr 2021 12:45:20 GMT
- Title: Investigation of Large-Margin Softmax in Neural Language Modeling
- Authors: Jingjing Huo, Yingbo Gao, Weiyue Wang, Ralf Schl\"uter, Hermann Ney
- Abstract summary: We investigate if introducing large-margins to neural language models would improve the perplexity and consequently word error rate in automatic speech recognition.
We find that although perplexity is slightly deteriorated, neural language models with large-margin softmax can yield word error rate similar to that of the standard softmax baseline.
- Score: 43.51826343967195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To encourage intra-class compactness and inter-class separability among
trainable feature vectors, large-margin softmax methods are developed and
widely applied in the face recognition community. The introduction of the
large-margin concept into the softmax is reported to have good properties such
as enhanced discriminative power, less overfitting and well-defined geometric
intuitions. Nowadays, language modeling is commonly approached with neural
networks using softmax and cross entropy. In this work, we are curious to see
if introducing large-margins to neural language models would improve the
perplexity and consequently word error rate in automatic speech recognition.
Specifically, we first implement and test various types of conventional margins
following the previous works in face recognition. To address the distribution
of natural language data, we then compare different strategies for word vector
norm-scaling. After that, we apply the best norm-scaling setup in combination
with various margins and conduct neural language models rescoring experiments
in automatic speech recognition. We find that although perplexity is slightly
deteriorated, neural language models with large-margin softmax can yield word
error rate similar to that of the standard softmax baseline. Finally, expected
margins are analyzed through visualization of word vectors, showing that the
syntactic and semantic relationships are also preserved.
Related papers
- Collapsed Language Models Promote Fairness [88.48232731113306]
We find that debiased language models exhibit collapsed alignment between token representations and word embeddings.
We design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods.
arXiv Detail & Related papers (2024-10-06T13:09:48Z) - Explicit Word Density Estimation for Language Modelling [24.8651840630298]
We propose a new family of language models based on NeuralODEs and the continuous analogue of Normalizing Flows.
In this work we propose a new family of language models based on NeuralODEs and the continuous analogue of Normalizing Flows and manage to improve on some of the baselines.
arXiv Detail & Related papers (2024-06-10T15:21:33Z) - Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck [11.416426888383873]
We find that smaller models can suffer from saturation, characterized as a drop in performance at some advanced point in training followed by a plateau.
This can be explained by a mismatch between the hidden dimension of smaller models and the high rank of the target contextual probability distribution.
We measure the effect of the softmax bottleneck in various settings and find that models based on less than 1000 hidden dimensions tend to adopt degenerate latent representations in late pretraining.
arXiv Detail & Related papers (2024-04-11T11:10:36Z) - Lexical semantics enhanced neural word embeddings [4.040491121427623]
hierarchy-fitting is a novel approach to modelling semantic similarity nuances inherently stored in the IS-A hierarchies.
Results demonstrate the efficacy of hierarchy-fitting in specialising neural embeddings with semantic relations in late fusion.
arXiv Detail & Related papers (2022-10-03T08:10:23Z) - Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations.
We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z) - Better Language Model with Hypernym Class Prediction [101.8517004687825]
Class-based language models (LMs) have been long devised to address context sparsity in $n$-gram LMs.
In this study, we revisit this approach in the context of neural LMs.
arXiv Detail & Related papers (2022-03-21T01:16:44Z) - Dependency-based Mixture Language Models [53.152011258252315]
We introduce the Dependency-based Mixture Language Models.
In detail, we first train neural language models with a novel dependency modeling objective.
We then formulate the next-token probability by mixing the previous dependency modeling probability distributions with self-attention.
arXiv Detail & Related papers (2022-03-19T06:28:30Z) - Smoothing and Shrinking the Sparse Seq2Seq Search Space [2.1828601975620257]
We show that entmax-based models effectively solve the cat got your tongue problem.
We also generalize label smoothing to the broader family of Fenchel-Young losses.
Our resulting label-smoothed entmax loss models set a new state of the art on multilingual grapheme-to-phoneme conversion.
arXiv Detail & Related papers (2021-03-18T14:45:38Z) - Neural Baselines for Word Alignment [0.0]
We study and evaluate neural models for unsupervised word alignment for four language pairs.
We show that neural versions of the IBM-1 and hidden Markov models vastly outperform their discrete counterparts.
arXiv Detail & Related papers (2020-09-28T07:51:03Z) - Mechanisms for Handling Nested Dependencies in Neural-Network Language
Models and Humans [75.15855405318855]
We studied whether a modern artificial neural network trained with "deep learning" methods mimics a central aspect of human sentence processing.
Although the network was solely trained to predict the next word in a large corpus, analysis showed the emergence of specialized units that successfully handled local and long-distance syntactic agreement.
We tested the model's predictions in a behavioral experiment where humans detected violations in number agreement in sentences with systematic variations in the singular/plural status of multiple nouns.
arXiv Detail & Related papers (2020-06-19T12:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.