Residual Learning of Neural Text Generation with $n$-gram Language Model
- URL: http://arxiv.org/abs/2210.14431v1
- Date: Wed, 26 Oct 2022 02:42:53 GMT
- Title: Residual Learning of Neural Text Generation with $n$-gram Language Model
- Authors: Huayang Li, Deng Cai, Jin Xu, Taro Watanabe
- Abstract summary: We learn a neural LM that fits the residual between an $n$-gram LM and the real-data distribution.
Our approach attains additional performance gains over popular standalone neural models consistently.
- Score: 41.26228768053928
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: $N$-gram language models (LM) have been largely superseded by neural LMs as
the latter exhibits better performance. However, we find that $n$-gram models
can achieve satisfactory performance on a large proportion of testing cases,
indicating they have already captured abundant knowledge of the language with
relatively low computational cost. With this observation, we propose to learn a
neural LM that fits the residual between an $n$-gram LM and the real-data
distribution. The combination of $n$-gram and neural LMs not only allows the
neural part to focus on the deeper understanding of language but also provides
a flexible way to customize an LM by switching the underlying $n$-gram model
without changing the neural model. Experimental results on three typical
language tasks (i.e., language modeling, machine translation, and
summarization) demonstrate that our approach attains additional performance
gains over popular standalone neural models consistently. We also show that our
approach allows for effective domain adaptation by simply switching to a
domain-specific $n$-gram model, without any extra training. Our code is
released at https://github.com/ghrua/NgramRes.
Related papers
- Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG [57.14250086701313]
We investigate the extent to which modern LMs generate $n$-grams from their training data.
We develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data.
arXiv Detail & Related papers (2024-06-18T21:31:19Z) - Neural-g: A Deep Learning Framework for Mixing Density Estimation [16.464806944964003]
Mixing (or prior) density estimation is an important problem in machine learning and statistics.
We propose neural-$g$, a new neural network-based estimator for $g$-modeling.
arXiv Detail & Related papers (2024-06-10T03:00:28Z) - The Role of $n$-gram Smoothing in the Age of Neural Networks [60.23726773548038]
This paper re-opens the role classical $n$-gram smoothing techniques may play in the age of neural language models.
We derive a framework for converting any $n$-gram smoothing technique into a regularizer compatible with neural language models.
arXiv Detail & Related papers (2024-03-25T22:42:19Z) - Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens [138.36729703589512]
We show that $n$-gram language models are still relevant in this era of neural large language models (LLMs)
This was done by modernizing $n$-gram LMs in two aspects. First, we train them at the same data scale as neural LLMs -- 5 trillion tokens.
Second, existing $n$-gram LMs use small $n$ which hinders their performance; we instead allow $n$ to be arbitrarily large, by introducing a new $infty$-gram LM with backoff.
arXiv Detail & Related papers (2024-01-30T19:03:49Z) - In-Context Language Learning: Architectures and Algorithms [73.93205821154605]
We study ICL through the lens of a new family of model problems we term in context language learning (ICLL)
We evaluate a diverse set of neural sequence models on regular ICLL tasks.
arXiv Detail & Related papers (2024-01-23T18:59:21Z) - A Natural Bias for Language Generation Models [31.44752136404971]
We show that we can endow standard neural language generation models with a separate module that reflects unigram frequency statistics as prior knowledge.
We use neural machine translation as a test bed for this simple technique and observe that it: (i) improves learning efficiency; (ii) achieves better overall performance; and perhaps most importantly: appears to disentangle strong frequency effects.
arXiv Detail & Related papers (2022-12-19T18:14:36Z) - Improving Robustness and Generality of NLP Models Using Disentangled
Representations [62.08794500431367]
Supervised neural networks first map an input $x$ to a single representation $z$, and then map $z$ to the output label $y$.
We present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning.
We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.
arXiv Detail & Related papers (2020-09-21T02:48:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.