An Analysis of the Differences Among Regional Varieties of Chinese in
Malay Archipelago
- URL: http://arxiv.org/abs/2209.04611v1
- Date: Sat, 10 Sep 2022 07:29:25 GMT
- Title: An Analysis of the Differences Among Regional Varieties of Chinese in
Malay Archipelago
- Authors: Nankai Lin, Sihui Fu, Hongyan Wu, Shengyi Jiang
- Abstract summary: Chinese features prominently in the Chinese communities located in the nations of Malay Archipelago.
Chinese has undergone the process of adjustment to the local languages and cultures, which leads to the occurrence of a Chinese variant in each country.
- Score: 5.030581940990434
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chinese features prominently in the Chinese communities located in the
nations of Malay Archipelago. In these countries, Chinese has undergone the
process of adjustment to the local languages and cultures, which leads to the
occurrence of a Chinese variant in each country. In this paper, we conducted a
quantitative analysis on Chinese news texts collected from five Malay
Archipelago nations, namely Indonesia, Malaysia, Singapore, Philippines and
Brunei, trying to figure out their differences with the texts written in modern
standard Chinese from a lexical and syntactic perspective. The statistical
results show that the Chinese variants used in these five nations are quite
different, diverging from their modern Chinese mainland counterpart. Meanwhile,
we managed to extract and classify several featured Chinese words used in each
nation. All these discrepancies reflect how Chinese evolves overseas, and
demonstrate the profound impact rom local societies and cultures on the
development of Chinese.
Related papers
- Do Chinese models speak Chinese languages? [3.1815791977708834]
Language ability provides insights into pre-training data curation.
China has a long history of explicit language policy, varying between inclusivity of minority languages and a Mandarin-first policy.
We test performance of Chinese and Western open-source LLMs on Asian regional and Chinese minority languages.
arXiv Detail & Related papers (2025-03-31T23:19:08Z) - A Topic-aware Comparable Corpus of Chinese Variations [0.6906005491572401]
Using Dcard for Taiwanese Mandarin and Sina Weibo for Mainland Chinese, we create a comparable corpus that updates regularly and reflects modern language use on social media.
arXiv Detail & Related papers (2024-11-17T04:06:12Z) - When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun [48.07219104902607]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.
Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z) - Evaluation of Google Translate for Mandarin Chinese translation using sentiment and semantic analysis [1.3999481573773074]
Machine translation using large language models (LLMs) is having a significant global impact.
Mandarin Chinese is the official language used for communication by the government and media in China.
In this study, we provide an automated assessment of translation quality of Google Translate with human experts using sentiment and semantic analysis.
arXiv Detail & Related papers (2024-09-08T04:03:55Z) - How Chinese are Chinese Language Models? The Puzzling Lack of Language Policy in China's LLMs [2.9123921488295768]
We evaluate six open-source multilingual LLMs pre-trained by Chinese companies on 18 languages.
Our experiments show Chinese LLMs performance on diverse languages is indistinguishable from international LLMs.
We find no sign of any consistent policy, either for or against, language diversity in China's LLM development.
arXiv Detail & Related papers (2024-07-12T19:21:40Z) - CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation [49.41531871253317]
We present a new Chinese Vision- Language Understanding Evaluation benchmark dataset.
The selection of object categories and images is entirely driven by Chinese native speakers.
We find that fine-tuning on Chinese culture-related VL datasets effectively enhances VLMs' understanding of Chinese culture.
arXiv Detail & Related papers (2024-07-01T08:35:37Z) - Historical patterns of rice farming explain modern-day language use in
China and Japan more than modernization and urbanization [13.57362490817339]
We used natural language processing to analyze a billion words to study cultural differences on Weibo, one of China's largest social media platforms.
We compared predictions from two common explanations about cultural differences in China (economic development and urban-rural differences) against the less-obvious legacy of rice versus wheat farming.
Across all word categories, rice explained twice as much variance as economic development and urbanization.
Rice areas used more words reflecting tight social ties, holistic thought, and a cautious, prevention orientation.
arXiv Detail & Related papers (2023-08-29T14:47:08Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - Comparing Biases and the Impact of Multilingual Training across Multiple
Languages [70.84047257764405]
We present a bias analysis across Italian, Chinese, English, Hebrew, and Spanish on the downstream sentiment analysis task.
We adapt existing sentiment bias templates in English to Italian, Chinese, Hebrew, and Spanish for four attributes: race, religion, nationality, and gender.
Our results reveal similarities in bias expression such as favoritism of groups that are dominant in each language's culture.
arXiv Detail & Related papers (2023-05-18T18:15:07Z) - Analyzing Gender Representation in Multilingual Models [59.21915055702203]
We focus on the representation of gender distinctions as a practical case study.
We examine the extent to which the gender concept is encoded in shared subspaces across different languages.
arXiv Detail & Related papers (2022-04-20T00:13:01Z) - The 'Letter' Distribution in the Chinese Language [24.507787098011907]
Studies have found that letters in some alphabetic writing languages have strikingly similar statistical usage frequency distributions.
This study provides new evidence of the consistency of human languages.
arXiv Detail & Related papers (2020-05-26T05:18:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.