The empirical structure of word frequency distributions
- URL: http://arxiv.org/abs/2001.05292v1
- Date: Thu, 9 Jan 2020 20:52:38 GMT
- Title: The empirical structure of word frequency distributions
- Authors: Michael Ramscar
- Abstract summary: I show that first names form natural communicative distributions in most languages.
I then show this pattern of findings replicates in communicative distributions of English nouns and verbs.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The frequencies at which individual words occur across languages follow power
law distributions, a pattern of findings known as Zipf's law. A vast literature
argues over whether this serves to optimize the efficiency of human
communication, however this claim is necessarily post hoc, and it has been
suggested that Zipf's law may in fact describe mixtures of other distributions.
From this perspective, recent findings that Sinosphere first (family) names are
geometrically distributed are notable, because this is actually consistent with
information theoretic predictions regarding optimal coding. First names form
natural communicative distributions in most languages, and I show that when
analyzed in relation to the communities in which they are used, first name
distributions across a diverse set of languages are both geometric and,
historically, remarkably similar, with power law distributions only emerging
when empirical distributions are aggregated. I then show this pattern of
findings replicates in communicative distributions of English nouns and verbs.
These results indicate that if lexical distributions support efficient
communication, they do so because their functional structures directly satisfy
the constraints described by information theory, and not because of Zipf's law.
Understanding the function of these information structures is likely to be key
to explaining humankind's remarkable communicative capacities.
Related papers
- Zipfian Whitening [7.927385005964994]
Most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are uniform.
In reality, word frequencies follow a highly non-uniform distribution, known as Zipf's law.
We show that simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance.
arXiv Detail & Related papers (2024-11-01T15:40:19Z) - Surprise! Uniform Information Density Isn't the Whole Story: Predicting Surprisal Contours in Long-form Discourse [54.08750245737734]
We propose that speakers modulate information rate based on location within a hierarchically-structured model of discourse.
We find that hierarchical predictors are significant predictors of a discourse's information contour and that deeply nested hierarchical predictors are more predictive than shallow ones.
arXiv Detail & Related papers (2024-10-21T14:42:37Z) - Testing the Predictions of Surprisal Theory in 11 Languages [77.45204595614]
We investigate the relationship between surprisal and reading times in eleven different languages.
By focusing on a more diverse set of languages, we argue that these results offer the most robust link to-date between information theory and incremental language processing across languages.
arXiv Detail & Related papers (2023-07-07T15:37:50Z) - A Cross-Linguistic Pressure for Uniform Information Density in Word
Order [79.54362557462359]
We use computational models to test whether real orders lead to greater information uniformity than counterfactual orders.
Among SVO languages, real word orders consistently have greater uniformity than reverse word orders.
Only linguistically implausible counterfactual orders consistently exceed the uniformity of real orders.
arXiv Detail & Related papers (2023-06-06T14:52:15Z) - A Latent Space Theory for Emergent Abilities in Large Language Models [5.033924641692716]
We show that languages are not created randomly but rather to communicate information.
A strong association between languages and their underlying meanings, resulting in a sparse joint distribution.
With the advent of LLMs trained on big data and large models, we can now precisely assess the marginal distribution of languages.
arXiv Detail & Related papers (2023-04-19T20:45:01Z) - Diffusion Models are Minimax Optimal Distribution Estimators [49.47503258639454]
We provide the first rigorous analysis on approximation and generalization abilities of diffusion modeling.
We show that when the true density function belongs to the Besov space and the empirical score matching loss is properly minimized, the generated data distribution achieves the nearly minimax optimal estimation rates.
arXiv Detail & Related papers (2023-03-03T11:31:55Z) - Norm of Word Embedding Encodes Information Gain [7.934452214142754]
We show that the squared norm of static word embedding encodes the information gain conveyed by the word.
We also demonstrate that both the KL divergence and the squared norm of embedding provide a useful metric of the informativeness of a word.
arXiv Detail & Related papers (2022-12-19T17:45:07Z) - Pragmatic Constraint on Distributional Semantics [6.091096843566857]
We show that Zipf-law token distribution emerges irrespective of the chosen tokenization.
We show that Zipf distribution is characterized by two distinct groups of tokens that differ both in terms of their frequency and their semantics.
arXiv Detail & Related papers (2022-11-20T17:51:06Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - On the Sentence Embeddings from Pre-trained Language Models [78.45172445684126]
In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited.
We find that BERT always induces a non-smooth anisotropic semantic space of sentences, which harms its performance of semantic similarity.
We propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned with an unsupervised objective.
arXiv Detail & Related papers (2020-11-02T13:14:57Z) - Re-evaluating phoneme frequencies [0.0]
We re-evaluate the distributions claimed to characterize phoneme frequencies.
We find evidence supporting earlier results, but also nuancing them and increasing our understanding of them.
We identify a potential account for why, despite there being an important role for phonetic substance in phonemic change, we could still expect inventories with highly diverse phonetic content to share similar distributions of phoneme frequencies.
arXiv Detail & Related papers (2020-06-09T12:05:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.