How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language Models
- URL: http://arxiv.org/abs/2408.16756v3
- Date: Mon, 17 Feb 2025 06:59:54 GMT
- Title: How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language Models
- Authors: Jiyue Jiang, Pengan Chen, Liheng Chen, Sheng Wang, Qinghang Bao, Lingpeng Kong, Yu Li, Chuan Wu,
- Abstract summary: underrepresented languages like Cantonese, spoken by over 85 million people, face significant development gaps.<n>Despite its wide use, Cantonese has scant representation in NLP research, especially compared to other languages from similarly developed regions.<n>We outline current Cantonese NLP methods and introduce new benchmarks designed to evaluate LLM performance in factual generation, mathematical logic, complex reasoning, and general knowledge in Cantonese.
- Score: 42.83419530688604
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid evolution of large language models (LLMs) has transformed the competitive landscape in natural language processing (NLP), particularly for English and other data-rich languages. However, underrepresented languages like Cantonese, spoken by over 85 million people, face significant development gaps, which is particularly concerning given the economic significance of the Guangdong-Hong Kong-Macau Greater Bay Area, and in substantial Cantonese-speaking populations in places like Singapore and North America. Despite its wide use, Cantonese has scant representation in NLP research, especially compared to other languages from similarly developed regions. To bridge these gaps, we outline current Cantonese NLP methods and introduce new benchmarks designed to evaluate LLM performance in factual generation, mathematical logic, complex reasoning, and general knowledge in Cantonese, which aim to advance open-source Cantonese LLM technology. We also propose future research directions and recommended models to enhance Cantonese LLM development.
Related papers
- Assessing Thai Dialect Performance in LLMs with Automatic Benchmarks and Human Evaluation [16.969791483451562]
We introduce a Thai local dialect benchmark covering Northern (Lanna), Northeastern (Isan), and Southern (Dambro) Thai.
We evaluate LLMs on five NLP tasks: summarization, question answering, translation, conversation, and food-related tasks.
Results show that LLM performance declines significantly in local Thai dialects compared to standard Thai.
arXiv Detail & Related papers (2025-04-08T10:49:45Z) - SEA-LION: Southeast Asian Languages in One Network [16.12423506306059]
We introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages.
The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer.
arXiv Detail & Related papers (2025-04-08T07:24:51Z) - Developing and Utilizing a Large-Scale Cantonese Dataset for Multi-Tasking in Large Language Models [37.92781445130664]
Despite having more than 85 million native speakers, Cantonese is still considered a low-resource language.
We collect Cantonese texts from a variety of sources, including open source corpora, Hong Kong-specific forums, Wikipedia, and Common Crawl data.
We conduct rigorous data processing through language filtering, quality filtering, content filtering, and de-duplication steps, successfully constructing a high-quality Cantonese corpus.
arXiv Detail & Related papers (2025-03-05T17:53:07Z) - The Rise and Down of Babel Tower: Investigating the Evolution Process of Multilingual Code Large Language Model [59.357993924917]
We study the evolution of multilingual capabilities in large language models (LLMs) during the pre-training process.
We propose the Babel Tower Hypothesis, which describes the entire process of LLMs acquiring new language capabilities.
We propose a novel method to construct an optimized pre-training corpus for multilingual code LLMs.
arXiv Detail & Related papers (2024-12-10T08:28:57Z) - Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs [13.558778781305998]
Large Language Models (LLMs) are predominantly designed with English as the primary language.
Even the few that are multilingual tend to exhibit strong English-centric biases.
This paper introduces novel automatic corpus-level metrics to assess the lexical and syntactic naturalness of multilingual outputs.
arXiv Detail & Related papers (2024-10-21T12:34:17Z) - Converging to a Lingua Franca: Evolution of Linguistic Regions and Semantics Alignment in Multilingual Large Language Models [11.423589362950812]
Large language models (LLMs) have demonstrated remarkable performance, particularly in multilingual contexts.
Recent studies suggest that LLMs can transfer skills learned in one language to others, but the internal mechanisms behind this ability remain unclear.
This paper provides insights into the internal workings of LLMs, offering a foundation for future improvements in their cross-lingual capabilities.
arXiv Detail & Related papers (2024-10-15T15:49:15Z) - SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages.
SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese.
Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z) - How Chinese are Chinese Language Models? The Puzzling Lack of Language Policy in China's LLMs [2.9123921488295768]
We evaluate six open-source multilingual LLMs pre-trained by Chinese companies on 18 languages.
Our experiments show Chinese LLMs performance on diverse languages is indistinguishable from international LLMs.
We find no sign of any consistent policy, either for or against, language diversity in China's LLM development.
arXiv Detail & Related papers (2024-07-12T19:21:40Z) - Teaching Large Language Models an Unseen Language on the Fly [32.83773919852362]
We introduce DiPMT++, a framework for adapting LLMs to unseen languages by in-context learning.
Using a dictionary and 5K parallel sentences only, DiPMT++ significantly enhances the performance of GPT-4 from 0 to 16 BLEU for Chinese-to-Zhuang translation.
We also validate the effectiveness of our framework on Kalamang, another unseen language.
arXiv Detail & Related papers (2024-02-29T13:50:47Z) - Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models [117.20416338476856]
Large language models (LLMs) demonstrate remarkable multilingual capabilities without being pre-trained on specially curated multilingual parallel corpora.
We propose a novel detection method, language activation probability entropy (LAPE), to identify language-specific neurons within LLMs.
Our findings indicate that LLMs' proficiency in processing a particular language is predominantly due to a small subset of neurons.
arXiv Detail & Related papers (2024-02-26T09:36:05Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - SeaLLMs -- Large Language Models for Southeast Asia [76.50157503379086]
We introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages.
SeaLLMs are built upon the Llama-2 model and further advanced through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning.
Our comprehensive evaluation demonstrates that SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities.
arXiv Detail & Related papers (2023-12-01T17:17:56Z) - On the (In)Effectiveness of Large Language Models for Chinese Text
Correction [44.32102000125604]
Large Language Models (LLMs) have amazed the entire Artificial Intelligence community.
This study focuses on Chinese Text Correction, a fundamental and challenging Chinese NLP task.
We empirically find that the LLMs currently have both amazing performance and unsatisfactory behavior for Chinese Text Correction.
arXiv Detail & Related papers (2023-07-18T06:48:52Z) - Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset [85.52036362232688]
Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong.
It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics.
We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
arXiv Detail & Related papers (2022-01-07T12:09:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.