Zipf Distributions from Two-Stage Symbolic Processes: Stability Under Stochastic Lexical Filtering
- URL: http://arxiv.org/abs/2511.21060v1
- Date: Wed, 26 Nov 2025 04:59:40 GMT
- Title: Zipf Distributions from Two-Stage Symbolic Processes: Stability Under Stochastic Lexical Filtering
- Authors: Vladimir Berman,
- Abstract summary: Zipf's law in language lacks a definitive origin, debated across fields.<n>This study explains Zipf-like behavior using geometric mechanisms without linguistic elements.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zipf's law in language lacks a definitive origin, debated across fields. This study explains Zipf-like behavior using geometric mechanisms without linguistic elements. The Full Combinatorial Word Model (FCWM) forms words from a finite alphabet, generating a geometric distribution of word lengths. Interacting exponential forces yield a power-law rank-frequency curve, determined by alphabet size and blank symbol probability. Simulations support predictions, matching English, Russian, and mixed-genre data. The symbolic model suggests Zipf-type laws arise from geometric constraints, not communicative efficiency.
Related papers
- Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space [56.37266873329401]
Large Language Models (LLMs) apply uniform computation to all tokens, despite language exhibiting highly non-uniform information density.<n>We propose $textbfDynamic Large Concept Models (DLCM)$, a hierarchical language modeling framework that learns semantic boundaries from latent representations and shifts from tokens to a compressed concept space where reasoning is more efficient.
arXiv Detail & Related papers (2025-12-31T04:19:33Z) - The Morphemic Origin of Zipf's Law: A Factorized Combinatorial Framework [0.0]
We present a simple structure based model of how words are formed from morphemes.<n>The model explains two major empirical facts: the typical distribution of word lengths and the appearance of Zipf like rank frequency curves.
arXiv Detail & Related papers (2025-12-13T16:58:06Z) - On Counts and Densities of Homogeneous Bent Functions: An Evolutionary Approach [60.00535100780336]
This paper examines the use of Evolutionary Algorithms (EAs) to evolve homogeneous bent Boolean functions.<n>We introduce the notion of density of homogeneous bent functions, facilitating the algorithmic design that results in finding quadratic and cubic bent functions in different numbers of variables.
arXiv Detail & Related papers (2025-11-16T15:33:40Z) - Random Text, Zipf's Law, Critical Length,and Implications for Large Language Models [0.0]
We study a deliberately simple, fully non-linguistic model of text.<n>A word is defined as a maximal block of non-space symbols.
arXiv Detail & Related papers (2025-11-14T23:05:59Z) - Pre-trained Models Perform the Best When Token Distributions Follow Zipf's Law [15.78540876600952]
We propose a method for determining the vocabulary size by analyzing token frequency distributions through Zipf's law.<n>We show that downstream task performance correlates with how closely token distributions follow power-law behavior, and that aligning with Zipfian scaling improves both model efficiency and effectiveness.
arXiv Detail & Related papers (2025-07-30T10:16:23Z) - On the class of coding optimality of human languages and the origins of Zipf's law [0.0]
We present a new class of optimality for coding systems.<n>Within that class, Zipf's law, the size-rank law and the size-probability law form a group-like structure.<n>All languages showing sufficient agreement with Zipf's law are potential members of the class.
arXiv Detail & Related papers (2025-05-26T14:05:45Z) - Zipfian Whitening [7.927385005964994]
Most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are uniform.
In reality, word frequencies follow a highly non-uniform distribution, known as Zipf's law.
We show that simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance.
arXiv Detail & Related papers (2024-11-01T15:40:19Z) - Lexinvariant Language Models [84.2829117441298]
Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM)
We study textitlexinvariantlanguage models that are invariant to lexical symbols and therefore do not need fixed token embeddings in practice.
We show that a lexinvariant LM can attain perplexity comparable to that of a standard language model, given a sufficiently long context.
arXiv Detail & Related papers (2023-05-24T19:10:46Z) - Truncation Sampling as Language Model Desmoothing [115.28983143361681]
Long samples of text from neural language models can be of poor quality.
Truncation sampling algorithms set some words' probabilities to zero at each step.
We introduce $eta$-sampling, which truncates words below an entropy-dependent probability threshold.
arXiv Detail & Related papers (2022-10-27T05:52:35Z) - Generalized Funnelling: Ensemble Learning and Heterogeneous Document
Embeddings for Cross-Lingual Text Classification [78.83284164605473]
emphFunnelling (Fun) is a recently proposed method for cross-lingual text classification.
We describe emphGeneralized Funnelling (gFun) as a generalization of Fun.
We show that gFun substantially improves over Fun and over state-of-the-art baselines.
arXiv Detail & Related papers (2021-09-17T23:33:04Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - 3D Correspondence Grouping with Compatibility Features [51.869670613445685]
We present a simple yet effective method for 3D correspondence grouping.
The objective is to accurately classify initial correspondences obtained by matching local geometric descriptors into inliers and outliers.
We propose a novel representation for 3D correspondences, dubbed compatibility feature (CF), to describe the consistencies within inliers and inconsistencies within outliers.
arXiv Detail & Related papers (2020-07-21T02:39:48Z) - Rethinking Positional Encoding in Language Pre-training [111.2320727291926]
We show that in absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations.
We propose a new positional encoding method called textbfTransformer with textbfUntied textPositional textbfEncoding (T)
arXiv Detail & Related papers (2020-06-28T13:11:02Z) - The empirical structure of word frequency distributions [0.0]
I show that first names form natural communicative distributions in most languages.
I then show this pattern of findings replicates in communicative distributions of English nouns and verbs.
arXiv Detail & Related papers (2020-01-09T20:52:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.