Unsupervised Mandarin-Cantonese Machine Translation
- URL: http://arxiv.org/abs/2301.03971v1
- Date: Tue, 10 Jan 2023 14:09:40 GMT
- Title: Unsupervised Mandarin-Cantonese Machine Translation
- Authors: Megan Dare, Valentina Fajardo Diaz, Averie Ho Zoen So, Yifan Wang,
Shibingfeng Zhang
- Abstract summary: We explored unsupervised machine translation between Mandarin Chinese and Cantonese.
Despite the vast number of native speakers of Cantonese, there is still no large-scale corpus for the language.
- Score: 3.1360838651190797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advancements in unsupervised machine translation have enabled the development
of machine translation systems that can translate between languages for which
there is not an abundance of parallel data available. We explored unsupervised
machine translation between Mandarin Chinese and Cantonese. Despite the vast
number of native speakers of Cantonese, there is still no large-scale corpus
for the language, due to the fact that Cantonese is primarily used for oral
communication. The key contributions of our project include: 1. The creation of
a new corpus containing approximately 1 million Cantonese sentences, and 2. A
large-scale comparison across different model architectures, tokenization
schemes, and embedding structures. Our best model trained with character-based
tokenization and a Transformer architecture achieved a character-level BLEU of
25.1 when translating from Mandarin to Cantonese and of 24.4 when translating
from Cantonese to Mandarin. In this paper we discuss our research process,
experiments, and results.
Related papers
- Low-Resource NMT: A Case Study on the Written and Spoken Languages in Hong Kong [25.358712649791393]
Spoken Cantonese can be transcribed into Chinese characters, which constitute the so-called written Cantonese.<n>Written Cantonese exhibits significant lexical and grammatical differences from standard written Chinese.<n>This paper describes a transformer-based neural machine translation (NMT) system for written-Chinese-to-written-Cantonese translation.
arXiv Detail & Related papers (2025-05-23T12:32:01Z) - Developing and Utilizing a Large-Scale Cantonese Dataset for Multi-Tasking in Large Language Models [37.92781445130664]
Despite having more than 85 million native speakers, Cantonese is still considered a low-resource language.
We collect Cantonese texts from a variety of sources, including open source corpora, Hong Kong-specific forums, Wikipedia, and Common Crawl data.
We conduct rigorous data processing through language filtering, quality filtering, content filtering, and de-duplication steps, successfully constructing a high-quality Cantonese corpus.
arXiv Detail & Related papers (2025-03-05T17:53:07Z) - The Role of Handling Attributive Nouns in Improving Chinese-To-English Machine Translation [5.64086253718739]
We specifically target the translation challenges posed by attributive nouns in Chinese, which frequently cause ambiguities in English translation.
By manually inserting the omitted particle X ('DE'), we improve how this critical function word is handled.
arXiv Detail & Related papers (2024-12-18T20:37:52Z) - Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages [55.157295899188476]
neural machine translation systems learn to map sentences of different languages into a common representation space.
In this work, we test this hypothesis by zero-shot translating from unseen languages.
We demonstrate that this setup enables zero-shot translation from entirely unseen languages.
arXiv Detail & Related papers (2024-08-05T07:58:58Z) - SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z) - HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation [29.990957948085956]
We introduce HK-LegiCoST, a new three-way parallel corpus of Cantonese-English translations.
We describe the notable challenges in corpus preparation: segmentation, alignment of long audio recordings, and sentence-level alignment with non-verbatim transcripts.
arXiv Detail & Related papers (2023-06-20T03:09:32Z) - Decomposed Prompting for Machine Translation Between Related Languages
using Large Language Models [55.35106713257871]
We introduce DecoMT, a novel approach of few-shot prompting that decomposes the translation process into a sequence of word chunk translations.
We show that DecoMT outperforms the strong few-shot prompting BLOOM model with an average improvement of 8 chrF++ scores across the examined languages.
arXiv Detail & Related papers (2023-05-22T14:52:47Z) - A Bilingual Parallel Corpus with Discourse Annotations [82.07304301996562]
This paper describes BWB, a large parallel corpus first introduced in Jiang et al. (2022), along with an annotated test set.
The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena.
arXiv Detail & Related papers (2022-10-26T12:33:53Z) - A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural
Machine Translation [6.090922774386845]
We propose a novel Chinese dialect TTS with a translation module.
It helps to convert Mandarin text into idiomatic expressions with correct orthography and grammar.
It is the first known work to incorporate translation with TTS.
arXiv Detail & Related papers (2022-06-10T07:46:34Z) - Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset [85.52036362232688]
Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong.
It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics.
We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
arXiv Detail & Related papers (2022-01-07T12:09:15Z) - Local Translation Services for Neglected Languages [0.0]
This research illustrates translating two historically interesting, but obfuscated languages: 1) hacker-speak ("l33t") and 2) reverse (or "mirror") writing as practiced by Leonardo da Vinci.
The original contribution highlights a fluent translator of hacker-speak in under 50 megabytes.
The long short-term memory, recurrent neural network (LSTM-RNN) extends previous work demonstrating an English-to-foreign translation service built from as little as 10,000 bilingual sentence pairs.
arXiv Detail & Related papers (2021-01-05T16:25:51Z) - Leveraging Multilingual News Websites for Building a Kurdish Parallel
Corpus [0.6445605125467573]
We present a corpus containing 12,327 translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji.
We also provide 1,797 and 650 translation pairs in English-Kurmanji and English-Sorani.
arXiv Detail & Related papers (2020-10-04T11:52:50Z) - Translation Artifacts in Cross-lingual Transfer Learning [51.66536640084888]
We show that machine translation can introduce subtle artifacts that have a notable impact in existing cross-lingual models.
In natural language inference, translating the premise and the hypothesis independently can reduce the lexical overlap between them.
We also improve the state-of-the-art in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points, respectively.
arXiv Detail & Related papers (2020-04-09T17:54:30Z) - g2pM: A Neural Grapheme-to-Phoneme Conversion Package for Mandarin
Chinese Based on a New Open Benchmark Dataset [14.323478990713477]
We introduce a new benchmark dataset that consists of 99,000+ sentences for Chinese polyphone disambiguation.
We train a simple neural network model on it, and find that it outperforms other preexisting G2P systems.
arXiv Detail & Related papers (2020-04-07T05:44:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.