The Morphemic Origin of Zipf's Law: A Factorized Combinatorial Framework
- URL: http://arxiv.org/abs/2512.12394v1
- Date: Sat, 13 Dec 2025 16:58:06 GMT
- Title: The Morphemic Origin of Zipf's Law: A Factorized Combinatorial Framework
- Authors: Vladimir Berman,
- Abstract summary: We present a simple structure based model of how words are formed from morphemes.<n>The model explains two major empirical facts: the typical distribution of word lengths and the appearance of Zipf like rank frequency curves.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a simple structure based model of how words are formed from morphemes. The model explains two major empirical facts: the typical distribution of word lengths and the appearance of Zipf like rank frequency curves. In contrast to classical explanations based on random text or communication efficiency, our approach uses only the combinatorial organization of prefixes, roots, suffixes and inflections. In this Morphemic Combinatorial Word Model, a word is created by activating several positional slots. Each slot turns on with a certain probability and selects one morpheme from its inventory. Morphemes are treated as stable building blocks that regularly appear in word formation and have characteristic positions. This mechanism produces realistic word length patterns with a concentrated middle zone and a thin long tail, closely matching real languages. Simulations with synthetic morpheme inventories also generate rank frequency curves with Zipf like exponents around 1.1-1.4, similar to English, Russian and Romance languages. The key result is that Zipf like behavior can emerge without meaning, communication pressure or optimization principles. The internal structure of morphology alone, combined with probabilistic activation of slots, is sufficient to create the robust statistical patterns observed across languages.
Related papers
- Zipf Distributions from Two-Stage Symbolic Processes: Stability Under Stochastic Lexical Filtering [0.0]
Zipf's law in language lacks a definitive origin, debated across fields.<n>This study explains Zipf-like behavior using geometric mechanisms without linguistic elements.
arXiv Detail & Related papers (2025-11-26T04:59:40Z) - Random Text, Zipf's Law, Critical Length,and Implications for Large Language Models [0.0]
We study a deliberately simple, fully non-linguistic model of text.<n>A word is defined as a maximal block of non-space symbols.
arXiv Detail & Related papers (2025-11-14T23:05:59Z) - Vocabulary embeddings organize linguistic structure early in language model training [3.2661767443292646]
Large language models (LLMs) work by manipulating the geometry of input embedding vectors over multiple layers.<n>Here, we ask: how are the input vocabulary representations of language models structured, and how does this structure evolve over training?<n>We run a suite of experiments that correlate the geometric structure of the input embeddings and output embeddings of two open-source models with semantic, syntactic, and frequency-based metrics over the course of training.
arXiv Detail & Related papers (2025-10-08T23:26:22Z) - Zipfian Whitening [7.927385005964994]
Most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are uniform.
In reality, word frequencies follow a highly non-uniform distribution, known as Zipf's law.
We show that simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance.
arXiv Detail & Related papers (2024-11-01T15:40:19Z) - Surprise! Uniform Information Density Isn't the Whole Story: Predicting Surprisal Contours in Long-form Discourse [54.08750245737734]
We propose that speakers modulate information rate based on location within a hierarchically-structured model of discourse.
We find that hierarchical predictors are significant predictors of a discourse's information contour and that deeply nested hierarchical predictors are more predictive than shallow ones.
arXiv Detail & Related papers (2024-10-21T14:42:37Z) - Probabilistic Transformer: A Probabilistic Dependency Model for
Contextual Word Representation [52.270712965271656]
We propose a new model of contextual word representation, not from a neural perspective, but from a purely syntactic and probabilistic perspective.
We find that the graph of our model resembles transformers, with correspondences between dependencies and self-attention.
Experiments show that our model performs competitively to transformers on small to medium sized datasets.
arXiv Detail & Related papers (2023-11-26T06:56:02Z) - Decomposing lexical and compositional syntax and semantics with deep
language models [82.81964713263483]
The activations of language transformers like GPT2 have been shown to linearly map onto brain activity during speech comprehension.
Here, we propose a taxonomy to factorize the high-dimensional activations of language models into four classes: lexical, compositional, syntactic, and semantic representations.
The results highlight two findings. First, compositional representations recruit a more widespread cortical network than lexical ones, and encompass the bilateral temporal, parietal and prefrontal cortices.
arXiv Detail & Related papers (2021-03-02T10:24:05Z) - Unnatural Language Inference [48.45003475966808]
We find that state-of-the-art NLI models, such as RoBERTa and BART, are invariant to, and sometimes even perform better on, examples with randomly reordered words.
Our findings call into question the idea that our natural language understanding models, and the tasks used for measuring their progress, genuinely require a human-like understanding of syntax.
arXiv Detail & Related papers (2020-12-30T20:40:48Z) - The empirical structure of word frequency distributions [0.0]
I show that first names form natural communicative distributions in most languages.
I then show this pattern of findings replicates in communicative distributions of English nouns and verbs.
arXiv Detail & Related papers (2020-01-09T20:52:38Z) - Multiplex Word Embeddings for Selectional Preference Acquisition [70.33531759861111]
We propose a multiplex word embedding model, which can be easily extended according to various relations among words.
Our model can effectively distinguish words with respect to different relations without introducing unnecessary sparseness.
arXiv Detail & Related papers (2020-01-09T04:47:14Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.