The optimality of word lengths. Theoretical foundations and an empirical
study
- URL: http://arxiv.org/abs/2208.10384v5
- Date: Wed, 5 Apr 2023 09:52:59 GMT
- Title: The optimality of word lengths. Theoretical foundations and an empirical
study
- Authors: Sonia Petrini, Antoni Casas-i-Mu\~noz, Jordi Cluet-i-Martinell,
Mengxue Wang, Christian Bentz and Ramon Ferrer-i-Cancho
- Abstract summary: Zipf's law of abbreviation has been viewed as a manifestation of compression.
We quantify for the first time the degree of optimality of word lengths in languages.
In general, spoken word durations are more optimized than written word lengths in characters.
- Score: 0.7682551949752529
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zipf's law of abbreviation, namely the tendency of more frequent words to be
shorter, has been viewed as a manifestation of compression, i.e. the
minimization of the length of forms -- a universal principle of natural
communication. Although the claim that languages are optimized has become
trendy, attempts to measure the degree of optimization of languages have been
rather scarce. Here we present two optimality scores that are dualy normalized,
namely, they are normalized with respect to both the minimum and the random
baseline. We analyze the theoretical and statistical pros and cons of these and
other scores. Harnessing the best score, we quantify for the first time the
degree of optimality of word lengths in languages. This indicates that
languages are optimized to 62 or 67 percent on average (depending on the
source) when word lengths are measured in characters, and to 65 percent on
average when word lengths are measured in time. In general, spoken word
durations are more optimized than written word lengths in characters. Our work
paves the way to measure the degree of optimality of the vocalizations or
gestures of other species, and to compare them against written, spoken, or
signed human languages.
Related papers
- Speech perception: a model of word recognition [0.0]
We present a model of speech perception which takes into account effects of correlations between sounds.
Words in this model correspond to the attractors of a suitably chosen descent dynamics.
We examine the decryption of short and long words in the presence of mishearings.
arXiv Detail & Related papers (2024-10-24T09:41:47Z) - Syntactic Language Change in English and German: Metrics, Parsers, and Convergences [56.47832275431858]
The current paper looks at diachronic trends in syntactic language change in both English and German, using corpora of parliamentary debates from the last c. 160 years.
We base our observations on five dependencys, including the widely used Stanford Core as well as 4 newer alternatives.
We show that changes in syntactic measures seem to be more frequent at the tails of sentence length distributions.
arXiv Detail & Related papers (2024-02-18T11:46:16Z) - Revisiting the Optimality of Word Lengths [92.70590105707639]
Communicative cost can be operationalized in different ways.
Zipf (1935) posited that wordforms are optimized to minimize utterances' communicative costs.
arXiv Detail & Related papers (2023-12-06T20:41:47Z) - A bounded rationality account of dependency length minimization in Hindi [0.0]
The principle of DEPENDENCY LENGTH MINIMIZATION is thought to shape the structure of human languages for effective communication.
Preverbally, the placement of long-before-short constituents and postverbally, short-before-long constituents are known to minimize overall dependency length of a sentence.
In this study, we test the hypothesis that placing only the shortest preverbal constituent next to the main-verb explains word order preferences in Hindi.
arXiv Detail & Related papers (2023-04-22T13:53:50Z) - Direct and indirect evidence of compression of word lengths. Zipf's law
of abbreviation revisited [0.4893345190925177]
Zipf's law of abbreviation, the tendency of more frequent words to be shorter, is one of the most solid candidates for a linguistic universal.
We provide evidence that the law holds also in speech (when word length is measured in time), in particular in 46 languages from 14 linguistic families.
Motivated by the need of direct evidence of compression, we derive a simple formula for a random baseline indicating that word lengths are systematically below chance.
arXiv Detail & Related papers (2023-03-17T17:12:18Z) - Long-range and hierarchical language predictions in brains and
algorithms [82.81964713263483]
We show that while deep language algorithms are optimized to predict adjacent words, the human brain would be tuned to make long-range and hierarchical predictions.
This study strengthens predictive coding theory and suggests a critical role of long-range and hierarchical predictions in natural language processing.
arXiv Detail & Related papers (2021-11-28T20:26:07Z) - Dependency distance minimization predicts compression [1.2944868613449219]
Dependency distance minimization (DDm) is a well-established principle of word order.
This is a second order prediction because it links a principle with another principle, rather than a principle and a manifestation as in a first order prediction.
We use a recently introduced score that has many mathematical and statistical advantages with respect to the widely used sum of dependency distances.
arXiv Detail & Related papers (2021-09-18T10:53:39Z) - Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take.
We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet.
This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z) - The optimality of syntactic dependency distances [0.802904964931021]
We recast the problem of the optimality of the word order of a sentence as an optimization problem on a spatial network.
We introduce a new score to quantify the cognitive pressure to reduce the distance between linked words in a sentence.
The analysis of sentences from 93 languages reveals that half of languages are optimized to a 70% or more.
arXiv Detail & Related papers (2020-07-30T09:40:41Z) - Toward Better Storylines with Sentence-Level Language Models [54.91921545103256]
We propose a sentence-level language model which selects the next sentence in a story from a finite set of fluent alternatives.
We demonstrate the effectiveness of our approach with state-of-the-art accuracy on the unsupervised Story Cloze task.
arXiv Detail & Related papers (2020-05-11T16:54:19Z) - Phonotactic Complexity and its Trade-offs [73.10961848460613]
This simple measure allows us to compare the entropy across languages.
We demonstrate a very strong negative correlation of -0.74 between bits per phoneme and the average length of words.
arXiv Detail & Related papers (2020-05-07T21:36:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.