ChrEn: Cherokee-English Machine Translation for Endangered Language
Revitalization
- URL: http://arxiv.org/abs/2010.04791v1
- Date: Fri, 9 Oct 2020 20:28:06 GMT
- Title: ChrEn: Cherokee-English Machine Translation for Endangered Language
Revitalization
- Authors: Shiyue Zhang, Benjamin Frey, Mohit Bansal
- Abstract summary: Cherokee is a highly endangered Native American language spoken by the Cherokee people.
There are approximately only 2,000 fluent first language Cherokee speakers remaining in the world.
- Score: 91.96528006301654
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cherokee is a highly endangered Native American language spoken by the
Cherokee people. The Cherokee culture is deeply embedded in its language.
However, there are approximately only 2,000 fluent first language Cherokee
speakers remaining in the world, and the number is declining every year. To
help save this endangered language, we introduce ChrEn, a Cherokee-English
parallel dataset, to facilitate machine translation research between Cherokee
and English. Compared to some popular machine translation language pairs, ChrEn
is extremely low-resource, only containing 14k sentence pairs in total. We
split our parallel data in ways that facilitate both in-domain and
out-of-domain evaluation. We also collect 5k Cherokee monolingual data to
enable semi-supervised learning. Besides these datasets, we propose several
Cherokee-English and English-Cherokee machine translation systems. We compare
SMT (phrase-based) versus NMT (RNN-based and Transformer-based) systems;
supervised versus semi-supervised (via language model, back-translation, and
BERT/Multilingual-BERT) methods; as well as transfer learning versus
multilingual joint training with 4 other languages. Our best results are
15.8/12.7 BLEU for in-domain and 6.5/5.0 BLEU for out-of-domain Chr-En/EnChr
translations, respectively, and we hope that our dataset and systems will
encourage future work by the community for Cherokee language revitalization.
Our data, code, and demo will be publicly available at
https://github.com/ZhangShiyue/ChrEn
Related papers
- Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages [55.157295899188476]
neural machine translation systems learn to map sentences of different languages into a common representation space.
In this work, we test this hypothesis by zero-shot translating from unseen languages.
We demonstrate that this setup enables zero-shot translation from entirely unseen languages.
arXiv Detail & Related papers (2024-08-05T07:58:58Z) - A Tulu Resource for Machine Translation [3.038642416291856]
We present the first parallel dataset for English-Tulu translation.
Tulu is spoken by approximately 2.5 million individuals in southwestern India.
Our English-Tulu system, trained without using parallel English-Tulu data, outperforms Google Translate by 19 BLEU points.
arXiv Detail & Related papers (2024-03-28T04:30:07Z) - Ngambay-French Neural Machine Translation (sba-Fr) [16.55378462843573]
In Africa, and the world at large, there is an increasing focus on developing Neural Machine Translation (NMT) systems to overcome language barriers.
In this project, we created the first sba-Fr dataset, which is a corpus of Ngambay-to-French translations.
Our experiments show that the M2M100 model outperforms other models with high BLEU scores on both original and original+synthetic data.
arXiv Detail & Related papers (2023-08-25T17:13:20Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - How can NLP Help Revitalize Endangered Languages? A Case Study and
Roadmap for the Cherokee Language [91.79339725967073]
More than 43% of the languages spoken in the world are endangered.
In this work, we focus on discussing how NLP can help revitalize endangered languages.
We take Cherokee, a severely-endangered Native American language, as a case study.
arXiv Detail & Related papers (2022-04-25T18:25:57Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Neural Machine Translation for Low-Resourced Indian Languages [4.726777092009554]
Machine translation is an effective approach to convert text to a different language without any human involvement.
In this paper, we have applied NMT on two of the most morphological rich Indian languages, i.e. English-Tamil and English-Malayalam.
We proposed a novel NMT model using Multihead self-attention along with pre-trained Byte-Pair-Encoded (BPE) and MultiBPE embeddings to develop an efficient translation system.
arXiv Detail & Related papers (2020-04-19T17:29:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.