Revisiting Simple Neural Probabilistic Language Models
- URL: http://arxiv.org/abs/2104.03474v1
- Date: Thu, 8 Apr 2021 02:18:47 GMT
- Title: Revisiting Simple Neural Probabilistic Language Models
- Authors: Simeng Sun, Mohit Iyyer
- Abstract summary: This paper revisits the neural probabilistic language model (NPLM) ofcitetBengio2003ANP.
When scaled up to modern hardware, this model performs much better than expected on word-level language model benchmarks.
Inspired by this result, we modify the Transformer by replacing its first self-attention layer with the NPLM's local concatenation layer.
- Score: 27.957834093475686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in language modeling has been driven not only by advances in
neural architectures, but also through hardware and optimization improvements.
In this paper, we revisit the neural probabilistic language model (NPLM)
of~\citet{Bengio2003ANP}, which simply concatenates word embeddings within a
fixed window and passes the result through a feed-forward network to predict
the next word. When scaled up to modern hardware, this model (despite its many
limitations) performs much better than expected on word-level language model
benchmarks. Our analysis reveals that the NPLM achieves lower perplexity than a
baseline Transformer with short input contexts but struggles to handle
long-term dependencies. Inspired by this result, we modify the Transformer by
replacing its first self-attention layer with the NPLM's local concatenation
layer, which results in small but consistent perplexity decreases across three
word-level language modeling datasets.
Related papers
- LlaMaVAE: Guiding Large Language Model Generation via Continuous Latent
Sentence Spaces [1.529963465178546]
We present LlaMaVAE, which combines expressive encoder and decoder models (sentenceT5 and LlaMA) with a VAE architecture to provide better text generation control to large language models (LLMs)
Experimental results reveal that LlaMaVAE can outperform the previous state-of-the-art VAE language model, Optimus, across various tasks.
arXiv Detail & Related papers (2023-12-20T17:25:23Z) - Meta-Learning Fast Weight Language Models [105.66999854213724]
We present Fast Weight Layers (FWLs), a neural component that provides the benefits of dynamic evaluation much more efficiently.
FWLs can be applied at training time so the model learns to make good use of gradient updates.
arXiv Detail & Related papers (2022-12-05T18:37:09Z) - Pre-Training a Graph Recurrent Network for Language Representation [34.4554387894105]
We consider a graph recurrent network for language model pre-training, which builds a graph structure for each sequence with local token-level communications.
We find that our model can generate more diverse outputs with less contextualized feature redundancy than existing attention-based models.
arXiv Detail & Related papers (2022-09-08T14:12:15Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - N-Grammer: Augmenting Transformers with latent n-grams [35.39961549040385]
We propose a simple yet effective modification to the Transformer architecture inspired by the literature in statistical language modeling, by augmenting the model with n-grams that are constructed from a discrete latent representation of the text sequence.
We evaluate our model, the N-Grammer on language modeling on the C4 data-set as well as text classification on the SuperGLUE data-set, and find that it outperforms several strong baselines such as the Transformer and the Primer.
arXiv Detail & Related papers (2022-07-13T17:18:02Z) - Better Language Model with Hypernym Class Prediction [101.8517004687825]
Class-based language models (LMs) have been long devised to address context sparsity in $n$-gram LMs.
In this study, we revisit this approach in the context of neural LMs.
arXiv Detail & Related papers (2022-03-21T01:16:44Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Character-level Transformer-based Neural Machine Translation [5.699756532377753]
We discuss a novel, Transformer-based approach, that we compare, both in speed and in quality to the Transformer at subword and character levels.
We evaluate our models on 4 language pairs from WMT'15: DE-EN, CS-EN, FI-EN and RU-EN.
The proposed novel architecture can be trained on a single GPU and is 34% percent faster than the character-level Transformer.
arXiv Detail & Related papers (2020-05-22T15:40:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.