Active Learning for Massively Parallel Translation of Constrained Text
into Low Resource Languages
- URL: http://arxiv.org/abs/2108.07127v1
- Date: Mon, 16 Aug 2021 14:49:50 GMT
- Title: Active Learning for Massively Parallel Translation of Constrained Text
into Low Resource Languages
- Authors: Zhong Zhou and Alex Waibel
- Abstract summary: We translate a closed text that is known in advance and available in many languages into a new and severely low resource language.
We compare the portion-based approach that optimize coherence of the text locally with the random sampling approach that increases coverage of the text globally.
We propose an algorithm for human and machine to work together seamlessly to translate a closed text into a severely low resource language.
- Score: 26.822210580244885
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We translate a closed text that is known in advance and available in many
languages into a new and severely low resource language. Most human translation
efforts adopt a portion-based approach to translate consecutive pages/chapters
in order, which may not suit machine translation. We compare the portion-based
approach that optimizes coherence of the text locally with the random sampling
approach that increases coverage of the text globally. Our results show that
the random sampling approach performs better. When training on a seed corpus of
~1,000 lines from the Bible and testing on the rest of the Bible (~30,000
lines), random sampling gives a performance gain of +11.0 BLEU using English as
a simulated low resource language, and +4.9 BLEU using Eastern Pokomchi, a
Mayan language. Furthermore, we compare three ways of updating machine
translation models with increasing amount of human post-edited data through
iterations. We find that adding newly post-edited data to training after
vocabulary update without self-supervision performs the best. We propose an
algorithm for human and machine to work together seamlessly to translate a
closed text into a severely low resource language.
Related papers
- Exploring Linguistic Similarity and Zero-Shot Learning for Multilingual
Translation of Dravidian Languages [0.34998703934432673]
We build a single-decoder neural machine translation system for Dravidian-Dravidian multilingual translation.
Our model achieves scores within 3 BLEU of large-scale pivot-based models when it is trained on 50% of the language directions.
arXiv Detail & Related papers (2023-08-10T13:38:09Z) - Train Global, Tailor Local: Minimalist Multilingual Translation into
Endangered Languages [26.159803412486955]
In humanitarian scenarios, translation into severely low resource languages often does not require a universal translation engine.
We attempt to leverage translation resources from many rich resource languages to efficiently produce best possible translation quality.
We find that adapting large pretrained multilingual models to the domain/text first and then to the severely low resource language works best.
arXiv Detail & Related papers (2023-05-05T23:22:16Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine
Translation [53.55009917938002]
We propose to refine the mined bitexts via automatic editing.
Experiments demonstrate that our approach successfully improves the quality of CCMatrix mined bitext for 5 low-resource language-pairs and 10 translation directions by up to 8 BLEU points.
arXiv Detail & Related papers (2021-11-12T16:00:39Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.