Universal versus system-specific features of punctuation usage patterns
in~major Western~languages
- URL: http://arxiv.org/abs/2212.11182v1
- Date: Wed, 21 Dec 2022 16:52:10 GMT
- Title: Universal versus system-specific features of punctuation usage patterns
in~major Western~languages
- Authors: Tomasz Stanisz, Stanislaw Drozdz, Jaroslaw Kwapien
- Abstract summary: In written texts punctuation can be considered one of its manifestations.
This study is based on a large corpus of world-famous and representative literary texts in seven major Western languages.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The celebrated proverb that "speech is silver, silence is golden" has a long
multinational history and multiple specific meanings. In written texts
punctuation can in fact be considered one of its manifestations. Indeed, the
virtue of effectively speaking and writing involves - often decisively - the
capacity to apply the properly placed breaks. In the present study, based on a
large corpus of world-famous and representative literary texts in seven major
Western languages, it is shown that the distribution of intervals between
consecutive punctuation marks in almost all texts can universally be
characterised by only two parameters of the discrete Weibull distribution which
can be given an intuitive interpretation in terms of the so-called hazard
function. The values of these two parameters tend to be language-specific,
however, and even appear to navigate translations. The properties of the
computed hazard functions indicate that among the studied languages, English
turns out to be the least constrained by the necessity to place a consecutive
punctuation mark to partition a sequence of words. This may suggest that when
compared to other studied languages, English is more flexible, in the sense of
allowing longer uninterrupted sequences of words. Spanish reveals similar
tendency to only a bit lesser extent.
Related papers
- Statistics of punctuation in experimental literature -- the remarkable case of "Finnegans Wake" by James Joyce [0.0]
The present work extends the analysis of punctuation usage patterns to more experimental pieces of world literature.
It turns out that the compliance of the the distances between punctuation marks with the discrete Weibull distribution typically applies here as well.
Some of the works by James Joyce are distinct in this regard - in the sense that the tails of the relevant distributions are significantly thicker.
arXiv Detail & Related papers (2024-08-31T15:30:51Z) - Complex systems approach to natural language [0.0]
Review summarizes the main methodological concepts used in studying natural language from the perspective of complexity science.
Three main complexity-related research trends in quantitative linguistics are covered.
arXiv Detail & Related papers (2024-01-05T12:01:26Z) - Quantifying the redundancy between prosody and text [67.07817268372743]
We use large language models to estimate how much information is redundant between prosody and the words themselves.
We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features.
Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.
arXiv Detail & Related papers (2023-11-28T21:15:24Z) - Cross-Linguistic Syntactic Difference in Multilingual BERT: How Good is
It and How Does It Affect Transfer? [50.48082721476612]
Multilingual BERT (mBERT) has demonstrated considerable cross-lingual syntactic ability.
We investigate the distributions of grammatical relations induced from mBERT in the context of 24 typologically different languages.
arXiv Detail & Related papers (2022-12-21T09:44:08Z) - Universality and diversity in word patterns [0.0]
We present an analysis of lexical statistical connections for eleven major languages.
We find that the diverse manners that languages utilize to express word relations give rise to unique pattern distributions.
arXiv Detail & Related papers (2022-08-23T20:03:27Z) - When is BERT Multilingual? Isolating Crucial Ingredients for
Cross-lingual Transfer [15.578267998149743]
We show that the absence of sub-word overlap significantly affects zero-shot transfer when languages differ in their word order.
There is a strong correlation between transfer performance and word embedding alignment between languages.
Our results call for focus in multilingual models on explicitly improving word embedding alignment between languages.
arXiv Detail & Related papers (2021-10-27T21:25:39Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Investigating Cross-Linguistic Adjective Ordering Tendencies with a
Latent-Variable Model [66.84264870118723]
We present the first purely corpus-driven model of multi-lingual adjective ordering in the form of a latent-variable model.
We provide strong converging evidence for the existence of universal, cross-linguistic, hierarchical adjective ordering tendencies.
arXiv Detail & Related papers (2020-10-09T18:27:55Z) - Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take.
We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet.
This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z) - Heaps' law and Heaps functions in tagged texts: Evidences of their
linguistic relevance [0.0]
We study the relationship between vocabulary size and text length in a corpus of $75$ literary works in English.
We analyze the progressive appearance of new words of each tag along each individual text.
arXiv Detail & Related papers (2020-01-07T17:05:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.