A Natural Bias for Language Generation Models
- URL: http://arxiv.org/abs/2212.09686v2
- Date: Fri, 23 Jun 2023 05:59:16 GMT
- Title: A Natural Bias for Language Generation Models
- Authors: Clara Meister, Wojciech Stokowiec, Tiago Pimentel, Lei Yu, Laura
Rimell, Adhiguna Kuncoro
- Abstract summary: We show that we can endow standard neural language generation models with a separate module that reflects unigram frequency statistics as prior knowledge.
We use neural machine translation as a test bed for this simple technique and observe that it: (i) improves learning efficiency; (ii) achieves better overall performance; and perhaps most importantly: appears to disentangle strong frequency effects.
- Score: 31.44752136404971
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: After just a few hundred training updates, a standard probabilistic model for
language generation has likely not yet learnt many semantic or syntactic rules
of natural language, making it difficult to estimate the probability
distribution over next tokens. Yet around this point, these models have
identified a simple, loss-minimising behaviour: to output the unigram
distribution of the target training corpus. The use of such a heuristic raises
the question: Can we initialise our models with this behaviour and save
precious compute resources and model capacity? Here we show that we can
effectively endow standard neural language generation models with a separate
module that reflects unigram frequency statistics as prior knowledge, simply by
initialising the bias term in a model's final linear layer with the log-unigram
distribution. We use neural machine translation as a test bed for this simple
technique and observe that it: (i) improves learning efficiency; (ii) achieves
better overall performance; and perhaps most importantly (iii) appears to
disentangle strong frequency effects by encouraging the model to specialise in
non-frequency-related aspects of language.
Related papers
- Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing [6.726629754291751]
We introduce a method for quantifying the frequency bias of a language model.
We then present a method for reducing the frequency bias of a language model by inducing a syntactic prior over token representations during pre-training.
This approach results in better performance on infrequent English tokens and a decrease in anisotropy.
arXiv Detail & Related papers (2024-10-15T10:09:57Z) - A Pseudo-Semantic Loss for Autoregressive Models with Logical
Constraints [87.08677547257733]
Neuro-symbolic AI bridges the gap between purely symbolic and neural approaches to learning.
We show how to maximize the likelihood of a symbolic constraint w.r.t the neural network's output distribution.
We also evaluate our approach on Sudoku and shortest-path prediction cast as autoregressive generation.
arXiv Detail & Related papers (2023-12-06T20:58:07Z) - Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations.
We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z) - Evaluating Distributional Distortion in Neural Language Modeling [81.83408583979745]
A heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language.
Standard language modeling metrics such as perplexity quantify the performance of language models (LM) in aggregate.
We develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages.
arXiv Detail & Related papers (2022-03-24T01:09:46Z) - Dependency-based Mixture Language Models [53.152011258252315]
We introduce the Dependency-based Mixture Language Models.
In detail, we first train neural language models with a novel dependency modeling objective.
We then formulate the next-token probability by mixing the previous dependency modeling probability distributions with self-attention.
arXiv Detail & Related papers (2022-03-19T06:28:30Z) - Typical Decoding for Natural Language Generation [76.69397802617064]
We study why high-probability texts can be dull or repetitive.
We show that typical sampling offers competitive performance in terms of quality.
arXiv Detail & Related papers (2022-02-01T18:58:45Z) - Multi-timescale Representation Learning in LSTM Language Models [69.98840820213937]
Language models must capture statistical dependencies between words at timescales ranging from very short to very long.
We derived a theory for how the memory gating mechanism in long short-term memory language models can capture power law decay.
Experiments showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution.
arXiv Detail & Related papers (2020-09-27T02:13:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.