Quantifying patterns of punctuation in modern Chinese prose
- URL: http://arxiv.org/abs/2503.04449v1
- Date: Thu, 06 Mar 2025 14:04:30 GMT
- Title: Quantifying patterns of punctuation in modern Chinese prose
- Authors: Michał Dolina, Jakub Dec, Stanisław Drożdż, Jarosław Kwapień, Jin Liu, Tomasz Stanisz,
- Abstract summary: Analysis of Western classical literature reveals that the distribution of spaces between punctuation marks aligns with a discrete Weibull distribution.<n>The distance distribution between punctuation marks in Chinese texts follows the Weibull model, though larger spacing is less frequent than in English translations.<n>This variability supports the formation of complex, multifractal sentence structures.
- Score: 1.9246599045323012
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research shows that punctuation patterns in texts exhibit universal features across languages. Analysis of Western classical literature reveals that the distribution of spaces between punctuation marks aligns with a discrete Weibull distribution, typically used in survival analysis. By extending this analysis to Chinese literature represented here by three notable contemporary works, it is shown that Zipf's law applies to Chinese texts similarly to Western texts, where punctuation patterns also improve adherence to the law. Additionally, the distance distribution between punctuation marks in Chinese texts follows the Weibull model, though larger spacing is less frequent than in English translations. Sentence-ending punctuation, representing sentence length, diverges more from this pattern, reflecting greater flexibility in sentence length. This variability supports the formation of complex, multifractal sentence structures, particularly evident in Gao Xingjian's "Soul Mountain". These findings demonstrate that both Chinese and Western texts share universal punctuation and word distribution patterns, underscoring their broad applicability across languages.
Related papers
- Punctuation patterns in "Finnegans Wake" by James Joyce are largely translation-invariant [0.0]
The complexity characteristics of texts written in natural languages are significantly related to the rules of punctuation.<n>Recent research shows that James Joyce's famous "Finnegans Wake" is subject to such extreme distribution from the Weibull family that the corresponding hazard function is clearly decreasing.<n>It is shown that the punctuation characteristics of this work remain largely translation invariant, contrary to the common cases.
arXiv Detail & Related papers (2025-01-22T15:27:43Z) - The Role of Handling Attributive Nouns in Improving Chinese-To-English Machine Translation [5.64086253718739]
We specifically target the translation challenges posed by attributive nouns in Chinese, which frequently cause ambiguities in English translation.<n>By manually inserting the omitted particle X ('DE'), we improve how this critical function word is handled.
arXiv Detail & Related papers (2024-12-18T20:37:52Z) - Statistics of punctuation in experimental literature -- the remarkable case of "Finnegans Wake" by James Joyce [0.0]
The present work extends the analysis of punctuation usage patterns to more experimental pieces of world literature.
It turns out that the compliance of the the distances between punctuation marks with the discrete Weibull distribution typically applies here as well.
Some of the works by James Joyce are distinct in this regard - in the sense that the tails of the relevant distributions are significantly thicker.
arXiv Detail & Related papers (2024-08-31T15:30:51Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - Complex systems approach to natural language [0.0]
Review summarizes the main methodological concepts used in studying natural language from the perspective of complexity science.
Three main complexity-related research trends in quantitative linguistics are covered.
arXiv Detail & Related papers (2024-01-05T12:01:26Z) - Narrowing the Gap between Zero- and Few-shot Machine Translation by
Matching Styles [53.92189950211852]
Large language models have demonstrated their ability to generalize to machine translation using zero- and few-shot examples with in-context learning.
In this paper, we investigate the factors contributing to this gap and find that this gap can largely be closed (for about 70%) by matching the writing styles of the target corpus.
arXiv Detail & Related papers (2023-11-04T03:18:45Z) - Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects.
In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts.
We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z) - Prompting Large Language Model for Machine Translation: A Case Study [87.88120385000666]
We offer a systematic study on prompting strategies for machine translation.
We examine factors for prompt template and demonstration example selection.
We explore the use of monolingual data and the feasibility of cross-lingual, cross-domain, and sentence-to-document transfer learning.
arXiv Detail & Related papers (2023-01-17T18:32:06Z) - Universal versus system-specific features of punctuation usage patterns
in~major Western~languages [0.0]
In written texts punctuation can be considered one of its manifestations.
This study is based on a large corpus of world-famous and representative literary texts in seven major Western languages.
arXiv Detail & Related papers (2022-12-21T16:52:10Z) - Investigating Cross-Linguistic Adjective Ordering Tendencies with a
Latent-Variable Model [66.84264870118723]
We present the first purely corpus-driven model of multi-lingual adjective ordering in the form of a latent-variable model.
We provide strong converging evidence for the existence of universal, cross-linguistic, hierarchical adjective ordering tendencies.
arXiv Detail & Related papers (2020-10-09T18:27:55Z) - The 'Letter' Distribution in the Chinese Language [24.507787098011907]
Studies have found that letters in some alphabetic writing languages have strikingly similar statistical usage frequency distributions.
This study provides new evidence of the consistency of human languages.
arXiv Detail & Related papers (2020-05-26T05:18:56Z) - Multilingual Alignment of Contextual Word Representations [49.42244463346612]
BERT exhibits significantly improved zero-shot performance on XNLI compared to the base model.
We introduce a contextual version of word retrieval and show that it correlates well with downstream zero-shot transfer.
These results support contextual alignment as a useful concept for understanding large multilingual pre-trained models.
arXiv Detail & Related papers (2020-02-10T03:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.