The 'Letter' Distribution in the Chinese Language
- URL: http://arxiv.org/abs/2006.01210v1
- Date: Tue, 26 May 2020 05:18:56 GMT
- Title: The 'Letter' Distribution in the Chinese Language
- Authors: Qinghua Chen, Yan Wang, Mengmeng Wang, Xiaomeng Li
- Abstract summary: Studies have found that letters in some alphabetic writing languages have strikingly similar statistical usage frequency distributions.
This study provides new evidence of the consistency of human languages.
- Score: 24.507787098011907
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Corpus-based statistical analysis plays a significant role in linguistic
research, and ample evidence has shown that different languages exhibit some
common laws. Studies have found that letters in some alphabetic writing
languages have strikingly similar statistical usage frequency distributions.
Does this hold for Chinese, which employs ideogram writing? We obtained letter
frequency data of some alphabetic writing languages and found the common law of
the letter distributions. In addition, we collected Chinese literature corpora
for different historical periods from the Tang Dynasty to the present, and we
dismantled the Chinese written language into three kinds of basic particles:
characters, strokes and constructive parts. The results of the statistical
analysis showed that, in different historical periods, the intensity of the use
of basic particles in Chinese writing varied, but the form of the distribution
was consistent. In particular, the distributions of the Chinese constructive
parts are certainly consistent with those alphabetic writing languages. This
study provides new evidence of the consistency of human languages.
Related papers
- Quantifying patterns of punctuation in modern Chinese prose [1.9246599045323012]
Analysis of Western classical literature reveals that the distribution of spaces between punctuation marks aligns with a discrete Weibull distribution.
The distance distribution between punctuation marks in Chinese texts follows the Weibull model, though larger spacing is less frequent than in English translations.
This variability supports the formation of complex, multifractal sentence structures.
arXiv Detail & Related papers (2025-03-06T14:04:30Z) - When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun [48.07219104902607]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.
Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z) - Computational Modelling of Plurality and Definiteness in Chinese Noun
Phrases [13.317456093426808]
We focus on the omission of the plurality and definiteness markers in Chinese noun phrases (NPs)
We build a corpus of Chinese NPs, each of which is accompanied by its corresponding context, and by labels indicating its singularity/plurality and definiteness/indefiniteness.
We train a bank of computational models using both classic machine learning models and state-of-the-art pre-trained language models to predict the plurality and definiteness of each NP.
arXiv Detail & Related papers (2024-03-07T10:06:54Z) - An Analysis of Letter Dynamics in the English Alphabet [0.0]
We expanded on the statistical analysis of the English alphabet by examining the average frequency which each letter appears in different categories of writings.
We developed a metric known as distance, d that can be used to algorithmically recognize different categories of writings.
arXiv Detail & Related papers (2024-01-28T03:54:41Z) - Comparing Biases and the Impact of Multilingual Training across Multiple
Languages [70.84047257764405]
We present a bias analysis across Italian, Chinese, English, Hebrew, and Spanish on the downstream sentiment analysis task.
We adapt existing sentiment bias templates in English to Italian, Chinese, Hebrew, and Spanish for four attributes: race, religion, nationality, and gender.
Our results reveal similarities in bias expression such as favoritism of groups that are dominant in each language's culture.
arXiv Detail & Related papers (2023-05-18T18:15:07Z) - Universality and diversity in word patterns [0.0]
We present an analysis of lexical statistical connections for eleven major languages.
We find that the diverse manners that languages utilize to express word relations give rise to unique pattern distributions.
arXiv Detail & Related papers (2022-08-23T20:03:27Z) - Analyzing Gender Representation in Multilingual Models [59.21915055702203]
We focus on the representation of gender distinctions as a practical case study.
We examine the extent to which the gender concept is encoded in shared subspaces across different languages.
arXiv Detail & Related papers (2022-04-20T00:13:01Z) - Cross-Lingual Ability of Multilingual Masked Language Models: A Study of
Language Structure [54.01613740115601]
We study three language properties: constituent order, composition and word co-occurrence.
Our main conclusion is that the contribution of constituent order and word co-occurrence is limited, while the composition is more crucial to the success of cross-linguistic transfer.
arXiv Detail & Related papers (2022-03-16T07:09:35Z) - Syllabic Quantity Patterns as Rhythmic Features for Latin Authorship
Attribution [74.27826764855911]
We employ syllabic quantity as a base for deriving rhythmic features for the task of computational authorship attribution of Latin prose texts.
Our experiments, carried out on three different datasets, using two different machine learning methods, show that rhythmic features based on syllabic quantity are beneficial in discriminating among Latin prose authors.
arXiv Detail & Related papers (2021-10-27T06:25:31Z) - ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin
Information [32.70080326854314]
We propose ChineseBERT, which incorporates the glyph and pinyin information of Chinese characters into language model pretraining.
The proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps.
arXiv Detail & Related papers (2021-06-30T13:06:00Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.