How Chinese are Chinese Language Models? The Puzzling Lack of Language Policy in China's LLMs
- URL: http://arxiv.org/abs/2407.09652v1
- Date: Fri, 12 Jul 2024 19:21:40 GMT
- Title: How Chinese are Chinese Language Models? The Puzzling Lack of Language Policy in China's LLMs
- Authors: Andrea W Wen-Yi, Unso Eun Seo Jo, Lu Jia Lin, David Mimno,
- Abstract summary: We evaluate six open-source multilingual LLMs pre-trained by Chinese companies on 18 languages.
Our experiments show Chinese LLMs performance on diverse languages is indistinguishable from international LLMs.
We find no sign of any consistent policy, either for or against, language diversity in China's LLM development.
- Score: 2.9123921488295768
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contemporary language models are increasingly multilingual, but Chinese LLM developers must navigate complex political and business considerations of language diversity. Language policy in China aims at influencing the public discourse and governing a multi-ethnic society, and has gradually transitioned from a pluralist to a more assimilationist approach since 1949. We explore the impact of these influences on current language technology. We evaluate six open-source multilingual LLMs pre-trained by Chinese companies on 18 languages, spanning a wide range of Chinese, Asian, and Anglo-European languages. Our experiments show Chinese LLMs performance on diverse languages is indistinguishable from international LLMs. Similarly, the models' technical reports also show lack of consideration for pretraining data language coverage except for English and Mandarin Chinese. Examining Chinese AI policy, model experiments, and technical reports, we find no sign of any consistent policy, either for or against, language diversity in China's LLM development. This leaves a puzzling fact that while China regulates both the languages people use daily as well as language model development, they do not seem to have any policy on the languages in language models.
Related papers
- Do Chinese models speak Chinese languages? [3.1815791977708834]
Language ability provides insights into pre-training data curation.
China has a long history of explicit language policy, varying between inclusivity of minority languages and a Mandarin-first policy.
We test performance of Chinese and Western open-source LLMs on Asian regional and Chinese minority languages.
arXiv Detail & Related papers (2025-03-31T23:19:08Z) - Multilingual Relative Clause Attachment Ambiguity Resolution in Large Language Models [2.3749120526936465]
Large language models (LLMs) resolve relative clause (RC) attachment ambiguities.
We assess whether LLMs can achieve human-like interpretations amid the complexities of language.
We evaluate models in English, Spanish, French, German, Japanese, and Korean.
arXiv Detail & Related papers (2025-03-04T19:56:56Z) - MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages [30.66853618502553]
We introduce MiLiC-Eval, a benchmark designed for minority languages in China, featuring 24K instances across 9 tasks.
MiLiC-Eval focuses on underrepresented writing systems and provides a fine-grained assessment of linguistic and problem-solving skills.
arXiv Detail & Related papers (2025-03-03T03:56:03Z) - Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Sense [30.62699081329474]
We introduce a novel benchmark for cross-lingual sense disambiguation, StingrayBench.
We collect false friends in four language pairs, namely Indonesian-Malay, Indonesian-Tagalog, Chinese-Japanese, and English-German.
In our analysis of various models, we observe they tend to be biased toward higher-resource languages.
arXiv Detail & Related papers (2024-10-28T22:09:43Z) - Large Language Models Reflect the Ideology of their Creators [71.65505524599888]
Large language models (LLMs) are trained on vast amounts of data to generate natural language.
This paper shows that the ideological stance of an LLM appears to reflect the worldview of its creators.
arXiv Detail & Related papers (2024-10-24T04:02:30Z) - Lens: Rethinking Multilingual Enhancement for Large Language Models [70.85065197789639]
Lens is a novel approach to enhance multilingual capabilities of large language models (LLMs)
It operates by manipulating the hidden representations within the language-agnostic and language-specific subspaces from top layers of LLMs.
It achieves superior results with much fewer computational resources compared to existing post-training approaches.
arXiv Detail & Related papers (2024-10-06T08:51:30Z) - Teaching LLMs to Abstain across Languages via Multilingual Feedback [40.84205285309612]
We show that multilingual feedback helps identify knowledge gaps across diverse languages, cultures, and communities.
Extensive experiments demonstrate that our multilingual feedback approach outperforms various strong baselines.
Further analysis reveals that multilingual feedback is both an effective and a more equitable abstain strategy to serve diverse language speakers.
arXiv Detail & Related papers (2024-06-22T21:59:12Z) - Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models [79.46179534911019]
Large language models (LLMs) have demonstrated multilingual capabilities; yet, they are mostly English-centric due to imbalanced training corpora.
This work extends the evaluation from NLP tasks to real user queries.
For culture-related tasks that need deep language understanding, prompting in the native language tends to be more promising.
arXiv Detail & Related papers (2024-03-15T12:47:39Z) - CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language.
CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances.
To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z) - Taiwan LLM: Bridging the Linguistic Divide with a Culturally Aligned
Language Model [31.68119156599923]
This paper introduces Taiwan LLM, a pioneering Large Language Model that specifically caters to the Traditional Chinese language.
We have developed a model that not only understands the complexities of Traditional Chinese but also embodies the cultural context of Taiwan.
arXiv Detail & Related papers (2023-11-29T09:48:34Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - Don't Trust ChatGPT when Your Question is not in English: A Study of
Multilingual Abilities and Types of LLMs [16.770697902481107]
Large Language Models (LLMs) have demonstrated exceptional natural language understanding abilities.
We propose a systematic way of qualifying the performance disparities of LLMs under multilingual settings.
The results show that GPT exhibits highly translating-like behaviour in multilingual settings.
arXiv Detail & Related papers (2023-05-24T02:05:03Z) - Cross-Lingual Ability of Multilingual Masked Language Models: A Study of
Language Structure [54.01613740115601]
We study three language properties: constituent order, composition and word co-occurrence.
Our main conclusion is that the contribution of constituent order and word co-occurrence is limited, while the composition is more crucial to the success of cross-linguistic transfer.
arXiv Detail & Related papers (2022-03-16T07:09:35Z) - CINO: A Chinese Minority Pre-trained Language Model [30.447739293695026]
We propose CINO (Chinese Minority Pre-trained Language Model), a multilingual pre-trained language model for Chinese minority languages.
It covers Standard Chinese, Cantonese, and six other Chinese minority languages.
arXiv Detail & Related papers (2022-02-28T06:02:06Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.