\textsc{CantoNLU}: A benchmark for Cantonese natural language understanding
- URL: http://arxiv.org/abs/2510.20670v1
- Date: Thu, 23 Oct 2025 15:47:27 GMT
- Title: \textsc{CantoNLU}: A benchmark for Cantonese natural language understanding
- Authors: Junghyun Min, York Hay Ng, Sophia Chan, Helena Shunhua Zhao, En-Shiun Annie Lee,
- Abstract summary: We introduce textsctextbfCantoNLU, a benchmark for Cantonese natural language understanding (NLU)<n>This novel benchmark spans seven tasks covering syntax and semantics, including word sense disambiguation, linguistic judgment, language detection, natural language inference, sentiment analysis, part-of-speech tagging, and dependency parsing.<n>Results show that Cantonese-adapted models perform best overall, while monolingual models perform better on syntactic tasks.
- Score: 2.6328168463115684
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cantonese, although spoken by millions, remains under-resourced due to policy and diglossia. To address this scarcity of evaluation frameworks for Cantonese, we introduce \textsc{\textbf{CantoNLU}}, a benchmark for Cantonese natural language understanding (NLU). This novel benchmark spans seven tasks covering syntax and semantics, including word sense disambiguation, linguistic acceptability judgment, language detection, natural language inference, sentiment analysis, part-of-speech tagging, and dependency parsing. In addition to the benchmark, we provide model baseline performance across a set of models: a Mandarin model without Cantonese training, two Cantonese-adapted models obtained by continual pre-training a Mandarin model on Cantonese text, and a monolingual Cantonese model trained from scratch. Results show that Cantonese-adapted models perform best overall, while monolingual models perform better on syntactic tasks. Mandarin models remain competitive in certain settings, indicating that direct transfer may be sufficient when Cantonese domain data is scarce. We release all datasets, code, and model weights to facilitate future research in Cantonese NLP.
Related papers
- Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically [58.019484208091534]
Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs.<n>It remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech.
arXiv Detail & Related papers (2025-05-26T07:21:20Z) - Low-Resource NMT: A Case Study on the Written and Spoken Languages in Hong Kong [25.358712649791393]
Spoken Cantonese can be transcribed into Chinese characters, which constitute the so-called written Cantonese.<n>Written Cantonese exhibits significant lexical and grammatical differences from standard written Chinese.<n>This paper describes a transformer-based neural machine translation (NMT) system for written-Chinese-to-written-Cantonese translation.
arXiv Detail & Related papers (2025-05-23T12:32:01Z) - HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs [0.0]
The HKCanto-Eval benchmark is designed to evaluate large language models on Cantonese language understanding tasks.<n>It integrates cultural and linguistic nuances intrinsic to Hong Kong, providing a robust framework for assessing language models in realistic scenarios.<n>Our findings indicate that while proprietary models generally outperform open-weight models, significant limitations remain in handling Cantonese-specific linguistic and cultural knowledge.
arXiv Detail & Related papers (2025-03-16T10:26:24Z) - Developing and Utilizing a Large-Scale Cantonese Dataset for Multi-Tasking in Large Language Models [37.92781445130664]
Despite having more than 85 million native speakers, Cantonese is still considered a low-resource language.<n>We collect Cantonese texts from a variety of sources, including open source corpora, Hong Kong-specific forums, Wikipedia, and Common Crawl data.<n>We conduct rigorous data processing through language filtering, quality filtering, content filtering, and de-duplication steps, successfully constructing a high-quality Cantonese corpus.
arXiv Detail & Related papers (2025-03-05T17:53:07Z) - How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language Models [42.83419530688604]
underrepresented languages like Cantonese, spoken by over 85 million people, face significant development gaps.<n>Despite its wide use, Cantonese has scant representation in NLP research, especially compared to other languages from similarly developed regions.<n>We outline current Cantonese NLP methods and introduce new benchmarks designed to evaluate LLM performance in factual generation, mathematical logic, complex reasoning, and general knowledge in Cantonese.
arXiv Detail & Related papers (2024-08-29T17:54:14Z) - Multilingual Conceptual Coverage in Text-to-Image Models [98.80343331645626]
"Conceptual Coverage Across Languages" (CoCo-CroLa) is a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns.
For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language.
arXiv Detail & Related papers (2023-06-02T17:59:09Z) - Hindi as a Second Language: Improving Visually Grounded Speech with
Semantically Similar Samples [89.16814518860357]
The objective of this work is to explore the learning of visually grounded speech models (VGS) from multilingual perspective.
Our key contribution in this work is to leverage the power of a high-resource language in a bilingual visually grounded speech model to improve the performance of a low-resource language.
arXiv Detail & Related papers (2023-03-30T16:34:10Z) - A Study of Modeling Rising Intonation in Cantonese Neural Speech
Synthesis [10.747119651974947]
Declarative questions are commonly used in daily Cantonese conversations.
Vanilla neural text-to-speech (TTS) systems are not capable of synthesizing rising intonation for these sentences.
We propose to complement the Cantonese TTS model with a BERT-based statement/question classifier.
arXiv Detail & Related papers (2022-08-03T16:21:08Z) - Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset [85.52036362232688]
Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong.
It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics.
We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
arXiv Detail & Related papers (2022-01-07T12:09:15Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.