Random Text, Zipf's Law, Critical Length,and Implications for Large Language Models
- URL: http://arxiv.org/abs/2511.17575v1
- Date: Fri, 14 Nov 2025 23:05:59 GMT
- Title: Random Text, Zipf's Law, Critical Length,and Implications for Large Language Models
- Authors: Vladimir Berman,
- Abstract summary: We study a deliberately simple, fully non-linguistic model of text.<n>A word is defined as a maximal block of non-space symbols.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study a deliberately simple, fully non-linguistic model of text: a sequence of independent draws from a finite alphabet of letters plus a single space symbol. A word is defined as a maximal block of non-space symbols. Within this symbol-level framework, which assumes no morphology, syntax, or semantics, we derive several structural results. First, word lengths follow a geometric distribution governed solely by the probability of the space symbol. Second, the expected number of words of a given length, and the expected number of distinct words of that length, admit closed-form expressions based on a coupon-collector argument. This yields a critical word length k* at which word types transition from appearing many times on average to appearing at most once. Third, combining the exponential growth of the number of possible strings of length k with the exponential decay of the probability of each string, we obtain a Zipf-type rank-frequency law p(r) proportional to r^{-alpha}, with an exponent determined explicitly by the alphabet size and the space probability. Our contribution is twofold. Mathematically, we give a unified derivation linking word lengths, vocabulary growth, critical length, and rank-frequency structure in a single explicit model. Conceptually, we argue that this provides a structurally grounded null model for both natural-language word statistics and token statistics in large language models. The results show that Zipf-like patterns can arise purely from combinatorics and segmentation, without optimization principles or linguistic organization, and help clarify which phenomena require deeper explanation beyond random-text structure.
Related papers
- Large language models and the entropy of English [0.0]
We use large language models to uncover long-ranged structure in English texts from a variety of sources.<n>The conditional entropy or code length in many cases continues to decrease with context length at least to $Nsim 104$ characters.<n>We observe different dynamics at long and short context lengths, suggesting that long-ranged structure is learned only gradually.
arXiv Detail & Related papers (2025-12-31T16:54:44Z) - Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space [56.37266873329401]
Large Language Models (LLMs) apply uniform computation to all tokens, despite language exhibiting highly non-uniform information density.<n>We propose $textbfDynamic Large Concept Models (DLCM)$, a hierarchical language modeling framework that learns semantic boundaries from latent representations and shifts from tokens to a compressed concept space where reasoning is more efficient.
arXiv Detail & Related papers (2025-12-31T04:19:33Z) - The Morphemic Origin of Zipf's Law: A Factorized Combinatorial Framework [0.0]
We present a simple structure based model of how words are formed from morphemes.<n>The model explains two major empirical facts: the typical distribution of word lengths and the appearance of Zipf like rank frequency curves.
arXiv Detail & Related papers (2025-12-13T16:58:06Z) - Zipf Distributions from Two-Stage Symbolic Processes: Stability Under Stochastic Lexical Filtering [0.0]
Zipf's law in language lacks a definitive origin, debated across fields.<n>This study explains Zipf-like behavior using geometric mechanisms without linguistic elements.
arXiv Detail & Related papers (2025-11-26T04:59:40Z) - Causal Estimation of Tokenisation Bias [58.20086589761273]
We quantify the effect of including or not a subword in a tokeniser's vocabulary on the probability a trained model assigns to the corresponding characters.<n>We find that tokenisation consistently affects models' outputs across scales, vocabularies, and tokenisers.<n> Notably, a subword's presence in a small model's vocabulary may increase its characters' probability by up to 17 times.
arXiv Detail & Related papers (2025-06-03T17:59:47Z) - Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length? [72.70486097967124]
We formalize a framework using deterministic finite automata (DFAs)<n>We show that there exists an optimal amount of reasoning tokens such that the probability of producing a correct solution is maximized.<n>We then demonstrate an implication of these findings: being able to predict the optimal number of reasoning tokens for new problems and filtering out non-optimal length answers results in consistent accuracy improvements.
arXiv Detail & Related papers (2025-04-02T17:45:58Z) - Leading Whitespaces of Language Models' Subword Vocabulary Pose a Confound for Calculating Word Probabilities [15.073507986272027]
We argue that there is a confound posed by the most common method of aggregating subword probabilities into word probabilities.
This is due to the fact that tokens in the subword vocabulary of most language models have leading whitespaces.
We present a simple decoding technique to reaccount the probability of the trailing whitespace into that of the current word.
arXiv Detail & Related papers (2024-06-16T08:44:56Z) - Lexinvariant Language Models [84.2829117441298]
Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM)
We study textitlexinvariantlanguage models that are invariant to lexical symbols and therefore do not need fixed token embeddings in practice.
We show that a lexinvariant LM can attain perplexity comparable to that of a standard language model, given a sufficiently long context.
arXiv Detail & Related papers (2023-05-24T19:10:46Z) - Linear-Time Modeling of Linguistic Structure: An Order-Theoretic
Perspective [97.57162770792182]
Tasks that model the relation between pairs of tokens in a string are a vital part of understanding natural language.
We show that these exhaustive comparisons can be avoided, and, moreover, the complexity can be reduced to linear by casting the relation between tokens as a partial order over the string.
Our method predicts real numbers for each token in a string in parallel and sorts the tokens accordingly, resulting in total orders of the tokens in the string.
arXiv Detail & Related papers (2023-05-24T11:47:35Z) - A Measure-Theoretic Characterization of Tight Language Models [105.16477132329416]
In some pathological cases, probability mass can leak'' onto the set of infinite sequences.
This paper offers a measure-theoretic treatment of language modeling.
We prove that many popular language model families are in fact tight, meaning that they will not leak in this sense.
arXiv Detail & Related papers (2022-12-20T18:17:11Z) - The distribution of syntactic dependency distances [0.13812010983144798]
We contribute to the characterization of the actual distribution of syntactic dependency distances.<n>We propose a new model with two exponential regimes in which the probability decay is allowed to change after a break-point.<n>We find that a two-regime model is the most likely one in all 20 languages we considered, independently of sentence length and annotation style.
arXiv Detail & Related papers (2022-11-26T17:31:25Z) - Language Models Explain Word Reading Times Better Than Empirical
Predictability [20.38397241720963]
The traditional approach in cognitive reading research assumes that word predictability from sentence context is best captured by cloze completion probability.
Probability language models provide deeper explanations for syntactic and semantic effects than CCP.
N-gram and RNN probabilities of the present word more consistently predicted reading performance compared with topic models or CCP.
arXiv Detail & Related papers (2022-02-02T16:38:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.