Direct and indirect evidence of compression of word lengths. Zipf's law
of abbreviation revisited
- URL: http://arxiv.org/abs/2303.10128v2
- Date: Sat, 27 May 2023 08:36:57 GMT
- Title: Direct and indirect evidence of compression of word lengths. Zipf's law
of abbreviation revisited
- Authors: Sonia Petrini, Antoni Casas-i-Mu\~noz, Jordi Cluet-i-Martinell,
Mengxue Wang, Chris Bentz and Ramon Ferrer-i-Cancho
- Abstract summary: Zipf's law of abbreviation, the tendency of more frequent words to be shorter, is one of the most solid candidates for a linguistic universal.
We provide evidence that the law holds also in speech (when word length is measured in time), in particular in 46 languages from 14 linguistic families.
Motivated by the need of direct evidence of compression, we derive a simple formula for a random baseline indicating that word lengths are systematically below chance.
- Score: 0.4893345190925177
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Zipf's law of abbreviation, the tendency of more frequent words to be
shorter, is one of the most solid candidates for a linguistic universal, in the
sense that it has the potential for being exceptionless or with a number of
exceptions that is vanishingly small compared to the number of languages on
Earth. Since Zipf's pioneering research, this law has been viewed as a
manifestation of a universal principle of communication, i.e. the minimization
of word lengths, to reduce the effort of communication. Here we revisit the
concordance of written language with the law of abbreviation. Crucially, we
provide wider evidence that the law holds also in speech (when word length is
measured in time), in particular in 46 languages from 14 linguistic families.
Agreement with the law of abbreviation provides indirect evidence of
compression of languages via the theoretical argument that the law of
abbreviation is a prediction of optimal coding. Motivated by the need of direct
evidence of compression, we derive a simple formula for a random baseline
indicating that word lengths are systematically below chance, across linguistic
families and writing systems, and independently of the unit of measurement
(length in characters or duration in time). Our work paves the way to measure
and compare the degree of optimality of word lengths in languages.
Related papers
- Speech perception: a model of word recognition [0.0]
We present a model of speech perception which takes into account effects of correlations between sounds.
Words in this model correspond to the attractors of a suitably chosen descent dynamics.
We examine the decryption of short and long words in the presence of mishearings.
arXiv Detail & Related papers (2024-10-24T09:41:47Z) - Work Smarter...Not Harder: Efficient Minimization of Dependency Length in SOV Languages [0.34530027457862006]
Moving a short preverbal constituent next to the main verb explains preverbal constituent ordering decisions better than global minimization of dependency length in SOV languages.
This research sheds light on the role of bounded rationality in linguistic decision-making and language evolution.
arXiv Detail & Related papers (2024-04-29T13:30:27Z) - Syntactic Language Change in English and German: Metrics, Parsers, and Convergences [56.47832275431858]
The current paper looks at diachronic trends in syntactic language change in both English and German, using corpora of parliamentary debates from the last c. 160 years.
We base our observations on five dependencys, including the widely used Stanford Core as well as 4 newer alternatives.
We show that changes in syntactic measures seem to be more frequent at the tails of sentence length distributions.
arXiv Detail & Related papers (2024-02-18T11:46:16Z) - Revisiting the Optimality of Word Lengths [92.70590105707639]
Communicative cost can be operationalized in different ways.
Zipf (1935) posited that wordforms are optimized to minimize utterances' communicative costs.
arXiv Detail & Related papers (2023-12-06T20:41:47Z) - Quantifying the redundancy between prosody and text [67.07817268372743]
We use large language models to estimate how much information is redundant between prosody and the words themselves.
We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features.
Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.
arXiv Detail & Related papers (2023-11-28T21:15:24Z) - A Cross-Linguistic Pressure for Uniform Information Density in Word
Order [79.54362557462359]
We use computational models to test whether real orders lead to greater information uniformity than counterfactual orders.
Among SVO languages, real word orders consistently have greater uniformity than reverse word orders.
Only linguistically implausible counterfactual orders consistently exceed the uniformity of real orders.
arXiv Detail & Related papers (2023-06-06T14:52:15Z) - A bounded rationality account of dependency length minimization in Hindi [0.0]
The principle of DEPENDENCY LENGTH MINIMIZATION is thought to shape the structure of human languages for effective communication.
Preverbally, the placement of long-before-short constituents and postverbally, short-before-long constituents are known to minimize overall dependency length of a sentence.
In this study, we test the hypothesis that placing only the shortest preverbal constituent next to the main-verb explains word order preferences in Hindi.
arXiv Detail & Related papers (2023-04-22T13:53:50Z) - The optimality of word lengths. Theoretical foundations and an empirical
study [0.7682551949752529]
Zipf's law of abbreviation has been viewed as a manifestation of compression.
We quantify for the first time the degree of optimality of word lengths in languages.
In general, spoken word durations are more optimized than written word lengths in characters.
arXiv Detail & Related papers (2022-08-22T15:03:31Z) - Dependency distance minimization predicts compression [1.2944868613449219]
Dependency distance minimization (DDm) is a well-established principle of word order.
This is a second order prediction because it links a principle with another principle, rather than a principle and a manifestation as in a first order prediction.
We use a recently introduced score that has many mathematical and statistical advantages with respect to the widely used sum of dependency distances.
arXiv Detail & Related papers (2021-09-18T10:53:39Z) - Disambiguatory Signals are Stronger in Word-initial Positions [48.18148856974974]
We point out the confounds in existing methods for comparing the informativeness of segments early in the word versus later in the word.
We find evidence across hundreds of languages that indeed there is a cross-linguistic tendency to front-load information in words.
arXiv Detail & Related papers (2021-02-03T18:19:16Z) - Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take.
We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet.
This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.