Low-Resource NMT: A Case Study on the Written and Spoken Languages in Hong Kong
- URL: http://arxiv.org/abs/2505.17816v1
- Date: Fri, 23 May 2025 12:32:01 GMT
- Title: Low-Resource NMT: A Case Study on the Written and Spoken Languages in Hong Kong
- Authors: Hei Yi Mak, Tan Lee,
- Abstract summary: Spoken Cantonese can be transcribed into Chinese characters, which constitute the so-called written Cantonese.<n>Written Cantonese exhibits significant lexical and grammatical differences from standard written Chinese.<n>This paper describes a transformer-based neural machine translation (NMT) system for written-Chinese-to-written-Cantonese translation.
- Score: 25.358712649791393
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The majority of inhabitants in Hong Kong are able to read and write in standard Chinese but use Cantonese as the primary spoken language in daily life. Spoken Cantonese can be transcribed into Chinese characters, which constitute the so-called written Cantonese. Written Cantonese exhibits significant lexical and grammatical differences from standard written Chinese. The rise of written Cantonese is increasingly evident in the cyber world. The growing interaction between Mandarin speakers and Cantonese speakers is leading to a clear demand for automatic translation between Chinese and Cantonese. This paper describes a transformer-based neural machine translation (NMT) system for written-Chinese-to-written-Cantonese translation. Given that parallel text data of Chinese and Cantonese are extremely scarce, a major focus of this study is on the effort of preparing good amount of training data for NMT. In addition to collecting 28K parallel sentences from previous linguistic studies and scattered internet resources, we devise an effective approach to obtaining 72K parallel sentences by automatically extracting pairs of semantically similar sentences from parallel articles on Chinese Wikipedia and Cantonese Wikipedia. We show that leveraging highly similar sentence pairs mined from Wikipedia improves translation performance in all test sets. Our system outperforms Baidu Fanyi's Chinese-to-Cantonese translation on 6 out of 8 test sets in BLEU scores. Translation examples reveal that our system is able to capture important linguistic transformations between standard Chinese and spoken Cantonese.
Related papers
- Developing and Utilizing a Large-Scale Cantonese Dataset for Multi-Tasking in Large Language Models [37.92781445130664]
Despite having more than 85 million native speakers, Cantonese is still considered a low-resource language.<n>We collect Cantonese texts from a variety of sources, including open source corpora, Hong Kong-specific forums, Wikipedia, and Common Crawl data.<n>We conduct rigorous data processing through language filtering, quality filtering, content filtering, and de-duplication steps, successfully constructing a high-quality Cantonese corpus.
arXiv Detail & Related papers (2025-03-05T17:53:07Z) - The Role of Handling Attributive Nouns in Improving Chinese-To-English Machine Translation [5.64086253718739]
We specifically target the translation challenges posed by attributive nouns in Chinese, which frequently cause ambiguities in English translation.<n>By manually inserting the omitted particle X ('DE'), we improve how this critical function word is handled.
arXiv Detail & Related papers (2024-12-18T20:37:52Z) - When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun [48.07219104902607]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.
Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z) - How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language Models [42.83419530688604]
underrepresented languages like Cantonese, spoken by over 85 million people, face significant development gaps.<n>Despite its wide use, Cantonese has scant representation in NLP research, especially compared to other languages from similarly developed regions.<n>We outline current Cantonese NLP methods and introduce new benchmarks designed to evaluate LLM performance in factual generation, mathematical logic, complex reasoning, and general knowledge in Cantonese.
arXiv Detail & Related papers (2024-08-29T17:54:14Z) - Enhancing Cross-lingual Transfer via Phonemic Transcription Integration [57.109031654219294]
PhoneXL is a framework incorporating phonemic transcriptions as an additional linguistic modality for cross-lingual transfer.
Our pilot study reveals phonemic transcription provides essential information beyond the orthography to enhance cross-lingual transfer.
arXiv Detail & Related papers (2023-07-10T06:17:33Z) - HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation [29.990957948085956]
We introduce HK-LegiCoST, a new three-way parallel corpus of Cantonese-English translations.
We describe the notable challenges in corpus preparation: segmentation, alignment of long audio recordings, and sentence-level alignment with non-verbatim transcripts.
arXiv Detail & Related papers (2023-06-20T03:09:32Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - Unsupervised Mandarin-Cantonese Machine Translation [3.1360838651190797]
We explored unsupervised machine translation between Mandarin Chinese and Cantonese.
Despite the vast number of native speakers of Cantonese, there is still no large-scale corpus for the language.
arXiv Detail & Related papers (2023-01-10T14:09:40Z) - A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural
Machine Translation [6.090922774386845]
We propose a novel Chinese dialect TTS with a translation module.
It helps to convert Mandarin text into idiomatic expressions with correct orthography and grammar.
It is the first known work to incorporate translation with TTS.
arXiv Detail & Related papers (2022-06-10T07:46:34Z) - Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset [85.52036362232688]
Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong.
It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics.
We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
arXiv Detail & Related papers (2022-01-07T12:09:15Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - Learning to Pronounce Chinese Without a Pronunciation Dictionary [10.622817647136667]
We demonstrate a program that learns to pronounce Chinese text in Mandarin, without a pronunciation dictionary.
From non-parallel streams of Chinese characters and Chinese pinyin syllables, it establishes a many-to-many mapping between characters and pronunciations.
Its token-level character-to-syllable accuracy is 89%, which significantly exceeds the 22% accuracy of prior work.
arXiv Detail & Related papers (2020-10-09T18:03:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.