Related papers: Direct and indirect evidence of compression of word lengths. Zipf's law of abbreviation revisited

Direct and indirect evidence of compression of word lengths. Zipf's law of abbreviation revisited

URL: http://arxiv.org/abs/2303.10128v2
Date: Sat, 27 May 2023 08:36:57 GMT
Title: Direct and indirect evidence of compression of word lengths. Zipf's law of abbreviation revisited
Authors: Sonia Petrini, Antoni Casas-i-Mu\~noz, Jordi Cluet-i-Martinell, Mengxue Wang, Chris Bentz and Ramon Ferrer-i-Cancho
Abstract summary: Zipf's law of abbreviation, the tendency of more frequent words to be shorter, is one of the most solid candidates for a linguistic universal. We provide evidence that the law holds also in speech (when word length is measured in time), in particular in 46 languages from 14 linguistic families. Motivated by the need of direct evidence of compression, we derive a simple formula for a random baseline indicating that word lengths are systematically below chance.
Score: 0.4893345190925177
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Zipf's law of abbreviation, the tendency of more frequent words to be shorter, is one of the most solid candidates for a linguistic universal, in the sense that it has the potential for being exceptionless or with a number of exceptions that is vanishingly small compared to the number of languages on Earth. Since Zipf's pioneering research, this law has been viewed as a manifestation of a universal principle of communication, i.e. the minimization of word lengths, to reduce the effort of communication. Here we revisit the concordance of written language with the law of abbreviation. Crucially, we provide wider evidence that the law holds also in speech (when word length is measured in time), in particular in 46 languages from 14 linguistic families. Agreement with the law of abbreviation provides indirect evidence of compression of languages via the theoretical argument that the law of abbreviation is a prediction of optimal coding. Motivated by the need of direct evidence of compression, we derive a simple formula for a random baseline indicating that word lengths are systematically below chance, across linguistic families and writing systems, and independently of the unit of measurement (length in characters or duration in time). Our work paves the way to measure and compare the degree of optimality of word lengths in languages.

Related papers

Speech perception: a model of word recognition [0.0]
We present a model of speech perception which takes into account effects of correlations between sounds. Words in this model correspond to the attractors of a suitably chosen descent dynamics. We examine the decryption of short and long words in the presence of mishearings.
arXiv Detail & Related papers (2024-10-24T09:41:47Z)
Work Smarter...Not Harder: Efficient Minimization of Dependency Length in SOV Languages [0.34530027457862006]
Moving a short preverbal constituent next to the main verb explains preverbal constituent ordering decisions better than global minimization of dependency length in SOV languages. This research sheds light on the role of bounded rationality in linguistic decision-making and language evolution.
arXiv Detail & Related papers (2024-04-29T13:30:27Z)
Syntactic Language Change in English and German: Metrics, Parsers, and Convergences [56.47832275431858]
The current paper looks at diachronic trends in syntactic language change in both English and German, using corpora of parliamentary debates from the last c. 160 years. We base our observations on five dependencys, including the widely used Stanford Core as well as 4 newer alternatives. We show that changes in syntactic measures seem to be more frequent at the tails of sentence length distributions.
arXiv Detail & Related papers (2024-02-18T11:46:16Z)
Revisiting the Optimality of Word Lengths [92.70590105707639]
Communicative cost can be operationalized in different ways. Zipf (1935) posited that wordforms are optimized to minimize utterances' communicative costs.
arXiv Detail & Related papers (2023-12-06T20:41:47Z)
Quantifying the redundancy between prosody and text [67.07817268372743]
We use large language models to estimate how much information is redundant between prosody and the words themselves. We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features. Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.
arXiv Detail & Related papers (2023-11-28T21:15:24Z)
A Cross-Linguistic Pressure for Uniform Information Density in Word Order [79.54362557462359]
We use computational models to test whether real orders lead to greater information uniformity than counterfactual orders. Among SVO languages, real word orders consistently have greater uniformity than reverse word orders. Only linguistically implausible counterfactual orders consistently exceed the uniformity of real orders.
arXiv Detail & Related papers (2023-06-06T14:52:15Z)
A bounded rationality account of dependency length minimization in Hindi [0.0]
The principle of DEPENDENCY LENGTH MINIMIZATION is thought to shape the structure of human languages for effective communication. Preverbally, the placement of long-before-short constituents and postverbally, short-before-long constituents are known to minimize overall dependency length of a sentence. In this study, we test the hypothesis that placing only the shortest preverbal constituent next to the main-verb explains word order preferences in Hindi.
arXiv Detail & Related papers (2023-04-22T13:53:50Z)
The optimality of word lengths. Theoretical foundations and an empirical study [0.7682551949752529]
Zipf's law of abbreviation has been viewed as a manifestation of compression. We quantify for the first time the degree of optimality of word lengths in languages. In general, spoken word durations are more optimized than written word lengths in characters.
arXiv Detail & Related papers (2022-08-22T15:03:31Z)
Dependency distance minimization predicts compression [1.2944868613449219]
Dependency distance minimization (DDm) is a well-established principle of word order. This is a second order prediction because it links a principle with another principle, rather than a principle and a manifestation as in a first order prediction. We use a recently introduced score that has many mathematical and statistical advantages with respect to the widely used sum of dependency distances.
arXiv Detail & Related papers (2021-09-18T10:53:39Z)
Disambiguatory Signals are Stronger in Word-initial Positions [48.18148856974974]
We point out the confounds in existing methods for comparing the informativeness of segments early in the word versus later in the word. We find evidence across hundreds of languages that indeed there is a cross-linguistic tendency to front-load information in words.
arXiv Detail & Related papers (2021-02-03T18:19:16Z)
Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take. We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet. This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.