Modeling the Unigram Distribution
- URL: http://arxiv.org/abs/2106.02289v1
- Date: Fri, 4 Jun 2021 07:02:49 GMT
- Title: Modeling the Unigram Distribution
- Authors: Irene Nikkarinen, Tiago Pimentel, Dami\'an E. Blasi, Ryan Cotterell
- Abstract summary: The unigram distribution is the non-contextual probability of finding a specific word form in a corpus.
We present a novel model for estimating it in a language.
- Score: 39.153612297712655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The unigram distribution is the non-contextual probability of finding a
specific word form in a corpus. While of central importance to the study of
language, it is commonly approximated by each word's sample frequency in the
corpus. This approach, being highly dependent on sample size, assigns zero
probability to any out-of-vocabulary (oov) word form. As a result, it produces
negatively biased probabilities for any oov word form, while positively biased
probabilities to in-corpus words. In this work, we argue in favor of properly
modeling the unigram distribution -- claiming it should be a central task in
natural language processing. With this in mind, we present a novel model for
estimating it in a language (a neuralization of Goldwater et al.'s (2011)
model) and show it produces much better estimates across a diverse set of 7
languages than the na\"ive use of neural character-level language models.
Related papers
- Leading Whitespaces of Language Models' Subword Vocabulary Pose a Confound for Calculating Word Probabilities [15.073507986272027]
We argue that there is a confound posed by the most common method of aggregating subword probabilities into word probabilities.
This is due to the fact that tokens in the subword vocabulary of most language models have leading whitespaces.
We present a simple decoding technique to reaccount the probability of the trailing whitespace into that of the current word.
arXiv Detail & Related papers (2024-06-16T08:44:56Z) - A Probability--Quality Trade-off in Aligned Language Models and its Relation to Sampling Adaptors [50.046717886067555]
We show that when sampling corpora from an aligned language model, there exists a trade-off between the strings' average reward and average log-likelihood.
We provide a formal treatment of this phenomenon and demonstrate how a choice of sampling adaptor allows for a selection of how much likelihood we exchange for the reward.
arXiv Detail & Related papers (2024-06-14T17:38:21Z) - Forcing Diffuse Distributions out of Language Models [70.28345569190388]
Despite being trained specifically to follow user instructions, today's instructiontuned language models perform poorly when instructed to produce random outputs.
We propose a fine-tuning method that encourages language models to output distributions that are diffuse over valid outcomes.
arXiv Detail & Related papers (2024-04-16T19:17:23Z) - Probabilistic Transformer: A Probabilistic Dependency Model for
Contextual Word Representation [52.270712965271656]
We propose a new model of contextual word representation, not from a neural perspective, but from a purely syntactic and probabilistic perspective.
We find that the graph of our model resembles transformers, with correspondences between dependencies and self-attention.
Experiments show that our model performs competitively to transformers on small to medium sized datasets.
arXiv Detail & Related papers (2023-11-26T06:56:02Z) - A Natural Bias for Language Generation Models [31.44752136404971]
We show that we can endow standard neural language generation models with a separate module that reflects unigram frequency statistics as prior knowledge.
We use neural machine translation as a test bed for this simple technique and observe that it: (i) improves learning efficiency; (ii) achieves better overall performance; and perhaps most importantly: appears to disentangle strong frequency effects.
arXiv Detail & Related papers (2022-12-19T18:14:36Z) - Typical Decoding for Natural Language Generation [76.69397802617064]
We study why high-probability texts can be dull or repetitive.
We show that typical sampling offers competitive performance in terms of quality.
arXiv Detail & Related papers (2022-02-01T18:58:45Z) - Parameter Space Factorization for Zero-Shot Learning across Tasks and
Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters.
We infer the posteriors over such latent variables based on data from seen task-language combinations.
Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.