The Role of $n$-gram Smoothing in the Age of Neural Networks
- URL: http://arxiv.org/abs/2403.17240v2
- Date: Tue, 30 Apr 2024 20:58:40 GMT
- Title: The Role of $n$-gram Smoothing in the Age of Neural Networks
- Authors: Luca Malagutti, Andrius Buinovskij, Anej Svete, Clara Meister, Afra Amini, Ryan Cotterell,
- Abstract summary: This paper re-opens the role classical $n$-gram smoothing techniques may play in the age of neural language models.
We derive a framework for converting any $n$-gram smoothing technique into a regularizer compatible with neural language models.
- Score: 60.23726773548038
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For nearly three decades, language models derived from the $n$-gram assumption held the state of the art on the task. The key to their success lay in the application of various smoothing techniques that served to combat overfitting. However, when neural language models toppled $n$-gram models as the best performers, $n$-gram smoothing techniques became less relevant. Indeed, it would hardly be an understatement to suggest that the line of inquiry into $n$-gram smoothing techniques became dormant. This paper re-opens the role classical $n$-gram smoothing techniques may play in the age of neural language models. First, we draw a formal equivalence between label smoothing, a popular regularization technique for neural language models, and add-$\lambda$ smoothing. Second, we derive a generalized framework for converting any $n$-gram smoothing technique into a regularizer compatible with neural language models. Our empirical results find that our novel regularizers are comparable to and, indeed, sometimes outperform label smoothing on language modeling and machine translation.
Related papers
- Neural Metamorphosis [72.88137795439407]
This paper introduces a new learning paradigm termed Neural Metamorphosis (NeuMeta), which aims to build self-morphable neural networks.
NeuMeta directly learns the continuous weight manifold of neural networks.
It sustains full-size performance even at a 75% compression rate.
arXiv Detail & Related papers (2024-10-10T14:49:58Z) - $S^3$ -- Semantic Signal Separation [0.16492989697868893]
We present a theory-driven topic modeling approach in neural embedding spaces.
$S3$ conceptualizes topics as independent axes of semantic space, and uncovers these with blind-source separation.
Our approach provides the most diverse, highly coherent topics, requires no preprocessing, and is demonstrated to be the fastest contextually sensitive topic model to date.
arXiv Detail & Related papers (2024-06-13T19:43:38Z) - Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens [138.36729703589512]
We show that $n$-gram language models are still relevant in this era of neural large language models (LLMs)
This was done by modernizing $n$-gram LMs in two aspects. First, we train them at the same data scale as neural LLMs -- 5 trillion tokens.
Second, existing $n$-gram LMs use small $n$ which hinders their performance; we instead allow $n$ to be arbitrarily large, by introducing a new $infty$-gram LM with backoff.
arXiv Detail & Related papers (2024-01-30T19:03:49Z) - Residual Learning of Neural Text Generation with $n$-gram Language Model [41.26228768053928]
We learn a neural LM that fits the residual between an $n$-gram LM and the real-data distribution.
Our approach attains additional performance gains over popular standalone neural models consistently.
arXiv Detail & Related papers (2022-10-26T02:42:53Z) - Modeling Target-Side Morphology in Neural Machine Translation: A
Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation.
A large amount of differently inflected word surface forms entails a larger vocabulary.
Some inflected forms of infrequent terms typically do not appear in the training corpus.
Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z) - LT-LM: a novel non-autoregressive language model for single-shot lattice
rescoring [55.16665077221941]
We propose a novel rescoring approach, which processes the entire lattice in a single call to the model.
The key feature of our rescoring policy is a novel non-autoregressive Lattice Transformer Language Model (LT-LM)
arXiv Detail & Related papers (2021-04-06T14:06:07Z) - GTAE: Graph-Transformer based Auto-Encoders for Linguistic-Constrained
Text Style Transfer [119.70961704127157]
Non-parallel text style transfer has attracted increasing research interests in recent years.
Current approaches still lack the ability to preserve the content and even logic of original sentences.
We propose a method called Graph Transformer based Auto-GTAE, which models a sentence as a linguistic graph and performs feature extraction and style transfer at the graph level.
arXiv Detail & Related papers (2021-02-01T11:08:45Z) - Neural Baselines for Word Alignment [0.0]
We study and evaluate neural models for unsupervised word alignment for four language pairs.
We show that neural versions of the IBM-1 and hidden Markov models vastly outperform their discrete counterparts.
arXiv Detail & Related papers (2020-09-28T07:51:03Z) - Investigation of Large-Margin Softmax in Neural Language Modeling [43.51826343967195]
We investigate if introducing large-margins to neural language models would improve the perplexity and consequently word error rate in automatic speech recognition.
We find that although perplexity is slightly deteriorated, neural language models with large-margin softmax can yield word error rate similar to that of the standard softmax baseline.
arXiv Detail & Related papers (2020-05-20T14:53:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.