Multi-timescale Representation Learning in LSTM Language Models
- URL: http://arxiv.org/abs/2009.12727v2
- Date: Thu, 18 Mar 2021 00:06:08 GMT
- Title: Multi-timescale Representation Learning in LSTM Language Models
- Authors: Shivangi Mahto, Vy A. Vo, Javier S. Turek, Alexander G. Huth
- Abstract summary: Language models must capture statistical dependencies between words at timescales ranging from very short to very long.
We derived a theory for how the memory gating mechanism in long short-term memory language models can capture power law decay.
Experiments showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution.
- Score: 69.98840820213937
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models must capture statistical dependencies between words at
timescales ranging from very short to very long. Earlier work has demonstrated
that dependencies in natural language tend to decay with distance between words
according to a power law. However, it is unclear how this knowledge can be used
for analyzing or designing neural network language models. In this work, we
derived a theory for how the memory gating mechanism in long short-term memory
(LSTM) language models can capture power law decay. We found that unit
timescales within an LSTM, which are determined by the forget gate bias, should
follow an Inverse Gamma distribution. Experiments then showed that LSTM
language models trained on natural English text learn to approximate this
theoretical distribution. Further, we found that explicitly imposing the
theoretical distribution upon the model during training yielded better language
model perplexity overall, with particular improvements for predicting
low-frequency (rare) words. Moreover, the explicit multi-timescale model
selectively routes information about different types of words through units
with different timescales, potentially improving model interpretability. These
results demonstrate the importance of careful, theoretically-motivated analysis
of memory and timescale in language models.
Related papers
- What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages [78.1866280652834]
Large language models (LM) are distributions over strings.
We investigate the learnability of regular LMs (RLMs) by RNN and Transformer LMs.
We find that the complexity of the RLM rank is strong and significant predictors of learnability for both RNNs and Transformers.
arXiv Detail & Related papers (2024-06-06T17:34:24Z) - Large language models can be zero-shot anomaly detectors for time series? [9.249657468385779]
sigllm is a framework for time series anomaly detection using large language models.
We present a prompt-based detection method that directly asks a language model to indicate which elements of the input are anomalies.
We show that the forecasting method significantly outperformed the prompting method in all 11 datasets with respect to the F1 score.
arXiv Detail & Related papers (2024-05-23T16:21:57Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - A Natural Bias for Language Generation Models [31.44752136404971]
We show that we can endow standard neural language generation models with a separate module that reflects unigram frequency statistics as prior knowledge.
We use neural machine translation as a test bed for this simple technique and observe that it: (i) improves learning efficiency; (ii) achieves better overall performance; and perhaps most importantly: appears to disentangle strong frequency effects.
arXiv Detail & Related papers (2022-12-19T18:14:36Z) - Is neural language acquisition similar to natural? A chronological
probing study [0.0515648410037406]
We present the chronological probing study of transformer English models such as MultiBERT and T5.
We compare the information about the language learned by the models in the process of training on corpora.
The results show that 1) linguistic information is acquired in the early stages of training 2) both language models demonstrate capabilities to capture various features from various levels of language.
arXiv Detail & Related papers (2022-07-01T17:24:11Z) - Dependency-based Mixture Language Models [53.152011258252315]
We introduce the Dependency-based Mixture Language Models.
In detail, we first train neural language models with a novel dependency modeling objective.
We then formulate the next-token probability by mixing the previous dependency modeling probability distributions with self-attention.
arXiv Detail & Related papers (2022-03-19T06:28:30Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.