Do Chinese models speak Chinese languages?
- URL: http://arxiv.org/abs/2504.00289v2
- Date: Mon, 07 Apr 2025 19:09:50 GMT
- Title: Do Chinese models speak Chinese languages?
- Authors: Andrea W Wen-Yi, Unso Eun Seo Jo, David Mimno,
- Abstract summary: Language ability provides insights into pre-training data curation.<n>China has a long history of explicit language policy, varying between inclusivity of minority languages and a Mandarin-first policy.<n>We test performance of Chinese and Western open-source LLMs on Asian regional and Chinese minority languages.
- Score: 3.1815791977708834
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The release of top-performing open-weight LLMs has cemented China's role as a leading force in AI development. Do these models support languages spoken in China? Or do they speak the same languages as Western models? Comparing multilingual capabilities is important for two reasons. First, language ability provides insights into pre-training data curation, and thus into resource allocation and development priorities. Second, China has a long history of explicit language policy, varying between inclusivity of minority languages and a Mandarin-first policy. To test whether Chinese LLMs today reflect an agenda about China's languages, we test performance of Chinese and Western open-source LLMs on Asian regional and Chinese minority languages. Our experiments on Information Parity and reading comprehension show Chinese models' performance across these languages correlates strongly (r=0.93) with Western models', with the sole exception being better Mandarin. Sometimes, Chinese models cannot identify languages spoken by Chinese minorities such as Kazakh and Uyghur, even though they are good at French and German. These results provide a window into current development priorities, suggest options for future development, and indicate guidance for end users.
Related papers
- MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages [30.66853618502553]
We introduce MiLiC-Eval, a benchmark designed for minority languages in China, featuring 24K instances across 9 tasks.<n>MiLiC-Eval focuses on underrepresented writing systems and provides a fine-grained assessment of linguistic and problem-solving skills.
arXiv Detail & Related papers (2025-03-03T03:56:03Z) - Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models [24.47838086336772]
Chinese SimpleQA is the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions.
We focus on the Chinese language over 6 major topics with 99 diverse subtopics.
arXiv Detail & Related papers (2024-11-11T17:10:56Z) - SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages.
SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese.
Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z) - How Chinese are Chinese Language Models? The Puzzling Lack of Language Policy in China's LLMs [2.9123921488295768]
We evaluate six open-source multilingual LLMs pre-trained by Chinese companies on 18 languages.
Our experiments show Chinese LLMs performance on diverse languages is indistinguishable from international LLMs.
We find no sign of any consistent policy, either for or against, language diversity in China's LLM development.
arXiv Detail & Related papers (2024-07-12T19:21:40Z) - CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation [49.41531871253317]
We present a new Chinese Vision- Language Understanding Evaluation benchmark dataset.
The selection of object categories and images is entirely driven by Chinese native speakers.
We find that fine-tuning on Chinese culture-related VL datasets effectively enhances VLMs' understanding of Chinese culture.
arXiv Detail & Related papers (2024-07-01T08:35:37Z) - Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model [36.01840141194335]
We introduce CT-LLM, a 2B large language model (LLM)
Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by incorporating Chinese textual data.
CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT.
arXiv Detail & Related papers (2024-04-05T15:20:02Z) - CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language.
CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances.
To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - CINO: A Chinese Minority Pre-trained Language Model [30.447739293695026]
We propose CINO (Chinese Minority Pre-trained Language Model), a multilingual pre-trained language model for Chinese minority languages.
It covers Standard Chinese, Cantonese, and six other Chinese minority languages.
arXiv Detail & Related papers (2022-02-28T06:02:06Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Investigating Transfer Learning in Multilingual Pre-trained Language
Models through Chinese Natural Language Inference [11.096793445651313]
We investigate the cross-lingual transfer abilities of XLM-R for Chinese and English natural language inference (NLI)
To better understand linguistic transfer, we created 4 categories of challenge and adversarial tasks for Chinese.
We find that cross-lingual models trained on English NLI do transfer well across our Chinese tasks.
arXiv Detail & Related papers (2021-06-07T22:00:18Z) - CPM: A Large-scale Generative Chinese Pre-trained Language Model [76.65305358932393]
We release the Chinese Pre-trained Language Model (CPM) with generative pre-training on large-scale Chinese training data.
CPM achieves strong performance on many NLP tasks in the settings of few-shot (even zero-shot) learning.
arXiv Detail & Related papers (2020-12-01T11:32:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.