Evolution of the lexicon: a probabilistic point of view
- URL: http://arxiv.org/abs/2510.22220v1
- Date: Sat, 25 Oct 2025 08:48:15 GMT
- Title: Evolution of the lexicon: a probabilistic point of view
- Authors: Maurizio Serva,
- Abstract summary: The Swadesh approach for determining the temporal separation between two languages relies on the process of words replacement.<n>Basic assumptions of the Swadesh approach are often unrealistic due to various contamination phenomena and misjudgments.<n>We show, from a purely probabilistic perspective, that taking into account this second random process significantly increases the precision in determining the temporal separation between two languages.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Swadesh approach for determining the temporal separation between two languages relies on the stochastic process of words replacement (when a complete new word emerges to represent a given concept). It is well known that the basic assumptions of the Swadesh approach are often unrealistic due to various contamination phenomena and misjudgments (horizontal transfers, variations over time and space of the replacement rate, incorrect assessments of cognacy relationships, presence of synonyms, and so on). All of this means that the results cannot be completely correct. More importantly, even in the unrealistic case that all basic assumptions are satisfied, simple mathematics places limits on the accuracy of estimating the temporal separation between two languages. These limits, which are purely probabilistic in nature and which are often neglected in lexicostatistical studies, are analyzed in detail in this article. Furthermore, in this work we highlight that the evolution of a language's lexicon is also driven by another stochastic process: gradual lexical modification of words. We show that this process equally also represents a major contribution to the reshaping of the vocabulary of languages over the centuries and we also show, from a purely probabilistic perspective, that taking into account this second random process significantly increases the precision in determining the temporal separation between two languages.
Related papers
- When Meanings Meet: Investigating the Emergence and Quality of Shared Concept Spaces during Multilingual Language Model Training [57.230355403478995]
We investigate the development of language-agnostic concept spaces during pretraining of EuroLLM.<n>We find that shared concept spaces emerge early and continue to refine, but that alignment with them is language-dependent.<n>In contrast to prior work, our fine-grained manual analysis reveals that some apparent gains in translation quality reflect shifts in behavior.
arXiv Detail & Related papers (2026-01-30T11:23:01Z) - A quantum semantic framework for natural language processing [0.0]
We argue that semantic degeneracy imposes fundamental limitations on modern NLP systems, because they operate within natural language itself.<n>We show that as an expression's complexity grows, the amount of contextual information required to reliably resolve its ambiguity explodesly.<n>We argue that meaning is dynamically actualized through an observer-dependent interpretive act, a process whose non-deterministic nature is most appropriately described by a non-classical, quantum-like logic.
arXiv Detail & Related papers (2025-06-11T18:00:30Z) - Robustness of the Random Language Model [0.0]
The model suggests a simple picture of first language learning as a type of annealing in the vast space of potential languages.
It implies a single continuous transition to grammatical syntax, at which the symmetry among potential words and categories is spontaneously broken.
Results are discussed in light of theory of first-language acquisition in linguistics, and recent successes in machine learning.
arXiv Detail & Related papers (2023-09-26T13:14:35Z) - Probabilistic Method of Measuring Linguistic Productivity [0.0]
I propose a new way of measuring linguistic productivity that objectively assesses the ability of an affix to be used to coin new complex words.
token frequency does not dominate the productivity measure but naturally influences the sampling of bases.
A corpus-based approach and randomised design assure that true neologisms and words coined long ago have equal chances to be selected.
arXiv Detail & Related papers (2023-08-24T08:36:28Z) - Reliable Detection and Quantification of Selective Forces in Language
Change [3.55026004901472]
We apply a recently-introduced method to corpus data to quantify the strength of selection in specific instances of historical language change.
We show that this method is more reliable and interpretable than similar methods that have previously been applied.
arXiv Detail & Related papers (2023-05-25T10:20:15Z) - Cross-Linguistic Syntactic Difference in Multilingual BERT: How Good is
It and How Does It Affect Transfer? [50.48082721476612]
Multilingual BERT (mBERT) has demonstrated considerable cross-lingual syntactic ability.
We investigate the distributions of grammatical relations induced from mBERT in the context of 24 typologically different languages.
arXiv Detail & Related papers (2022-12-21T09:44:08Z) - The distribution of syntactic dependency distances [0.13812010983144798]
We contribute to the characterization of the actual distribution of syntactic dependency distances.<n>We propose a new model with two exponential regimes in which the probability decay is allowed to change after a break-point.<n>We find that a two-regime model is the most likely one in all 20 languages we considered, independently of sentence length and annotation style.
arXiv Detail & Related papers (2022-11-26T17:31:25Z) - Contextualized language models for semantic change detection: lessons
learned [4.436724861363513]
We present a qualitative analysis of the outputs of contextualized embedding-based methods for detecting diachronic semantic change.
Our findings show that contextualized methods can often predict high change scores for words which are not undergoing any real diachronic semantic shift.
Our conclusion is that pre-trained contextualized language models are prone to confound changes in lexicographic senses and changes in contextual variance.
arXiv Detail & Related papers (2022-08-31T23:35:24Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.<n>We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - Fake it Till You Make it: Self-Supervised Semantic Shifts for
Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change.
We show that our method can be used for the detection of semantic change with any alignment method.
We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z) - Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take.
We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet.
This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z) - Where New Words Are Born: Distributional Semantic Analysis of Neologisms
and Their Semantic Neighborhoods [51.34667808471513]
We investigate the importance of two factors, semantic sparsity and frequency growth rates of semantic neighbors, formalized in the distributional semantics paradigm.
We show that both factors are predictive word emergence although we find more support for the latter hypothesis.
arXiv Detail & Related papers (2020-01-21T19:09:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.